Hi,
I have enabled the userspace pmu access by building a kernel module for both core on Cortex-a9. Then I follow the standard procedure of pmu counting:
1. Disable performance counters
2. Set cycle counter tick rate
3. Reset performance counters
4. Enable performance counters
5. Call function to profile
6. Disable performance counters
7. Read out performance counters
8. Check that performance counters did not overflow
The program can successfully read the pmu counter values without overflow. The problem is:
When only single thread is used for one core, the cycle counter and programmable counter can give the right cycle numbers (matched PAPI result). However, the value from the cycle counter is always the same for different profiling codes when two separate threads running on both core (the cycle number also remains the same for different profiling codes using PAPI). By the way, I am using the pthread library to create two thread running two cores through the processor affinity and the platform is ZYNQ 7000 SoC.
The situation happens not only when using the pmu but also PAPI. Did I miss some important steps to read PMU correctly when multiple thread on dual core is used?
Thanks.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The example profiling code is like this:
matrix_multply1(){
cpu_set_t cpuset;
cpu_set_t cpuget;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);
if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {
fprintf(stderr, "set thread affinity failed\n");
}
if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){
printf("can not get thread affinity!\n");
if(CPU_ISSET(1, &cpuget)){
printf("i am running on processor 1\n");
......
matrix_multply2(){
main(){
disable_ccnt();
disable_pmn(0);
disable_pmn(1);
disable_pmn(2);
disable_pmn(3);
disable_pmn(4);
disable_pmn(5);
reset_ccnt();
reset_pmn();
pmn_config(0,0x11);// total cycle !!!!!!!!!!
pmn_config(1,0x68);// total instruction !!!!!!!!!!
pmn_config(2,0x04);// data cache access !!!!!!!!!!
pmn_config(3,0x03);// data cache miss !!!!!!!!!!
pmn_config(4,0x10);// branch miss-predicted !!!!!!!!!!
pmn_config(5,0x12);// Predictable branches !!!!!!!!!!
enable_ccnt();
enable_pmn(0);
enable_pmn(1);
enable_pmn(2);
enable_pmn(3);
enable_pmn(4);
enable_pmn(5);
// motion estimation dual
for(i=0;i<len;i++)
{
pthread_create (&thread1, NULL, (void *) &matrix_multply1, (void *) &medata1);
pthread_create (&thread2, NULL, (void *) &matrix_multply2, (void *) &medata2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
time_end = rdtsc32();
time_end1 = read_pmn(0);
time_end2 = read_pmn(1);
time_end3 = read_pmn(2);
time_end4 = read_pmn(3);
time_end5 = read_pmn(4);
time_end6 = read_pmn(5);
printf("cycle=%d\n instruction=%d\n cache access=%d\n"
"cache miss=%d\n,branch miss-predicted=%d\n"
"Predictable branches=%d\n",
(time_end1 - time_start1)/len,
(time_end2 - time_start2)/len,
(time_end3 - time_start3)/len,
(time_end4 - time_start4)/len,
(time_end5 - time_start5)/len,
(time_end6 - time_start6)/len);
e.g. I changed the matrix size or the loop length while the cycle number remains the same, however the timing is different using clock().
Your code seems try to achieve different thing from your question. Are you benchmarking one main() program and run it in user space with different 'len' setting, or do you want to measure the cycle of each thread? Have you considered using the kernel profiling tool perf?
Also you need to ensure the access to PMU in user space is allowed:
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));
Thanks for the reply.
1. Yes, this code is try to run two identical thread on each core.
2. Using different 'len' is a simplier case. I also use different applications such as FFT, Motion Estimation. They are all similar issue as I described.
3. The PAPI is based on Perf Event Library. They are identical. Actually, I found this problem by using perf_event approach. At the beginning, I doubt the perf_event cannot correctly profile the code. After using the PMU through arm instruction, it confirmed this situation.
Thanks for reply.
1. Yes. I built a kernel module which enable the PMU from user level for each core on Cortex-A9. The profiling results are confirmed to be the same as perf library and PAPI tool-set when using single core.
2. The instruction codes are obtained from the example instructions from DS-5.