This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to read PMU correctly for multi-thread on dual core Cortex-A9 under userspace

yunwu over 11 years ago

Hi,

I have enabled the userspace pmu access by building a kernel module for both core on Cortex-a9. Then I follow the standard procedure of pmu counting:

1. Disable performance counters

2. Set cycle counter tick rate

3. Reset performance counters

4. Enable performance counters

5. Call function to profile

6. Disable performance counters

7. Read out performance counters

8. Check that performance counters did not overflow

The program can successfully read the pmu counter values without overflow. The problem is:

When only single thread is used for one core, the cycle counter and programmable counter can give the right cycle numbers (matched PAPI result). However, the value from the cycle counter is always the same for different profiling codes when two separate threads running on both core (the cycle number also remains the same for different profiling codes using PAPI). By the way, I am using the pthread library to create two thread running two cores through the processor affinity and the platform is ZYNQ 7000 SoC.

The situation happens not only when using the pmu but also PAPI. Did I miss some important steps to read PMU correctly when multiple thread on dual core is used?

Thanks.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The example profiling code is like this:

matrix_multply1(){

cpu_set_t cpuset;

cpu_set_t cpuget;

CPU_ZERO(&cpuset);

CPU_SET(1, &cpuset);

if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

fprintf(stderr, "set thread affinity failed\n");

}

if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

printf("can not get thread affinity!\n");

}

if(CPU_ISSET(1, &cpuget)){

printf("i am running on processor 1\n");

}

......

}

matrix_multply2(){

cpu_set_t cpuset;

cpu_set_t cpuget;

CPU_ZERO(&cpuset);

CPU_SET(1, &cpuset);

if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

fprintf(stderr, "set thread affinity failed\n");

}

if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

printf("can not get thread affinity!\n");

}

if(CPU_ISSET(1, &cpuget)){

printf("i am running on processor 1\n");

}

......

}

main(){

......

disable_ccnt();

disable_pmn(0);

disable_pmn(1);

disable_pmn(2);

disable_pmn(3);

disable_pmn(4);

disable_pmn(5);

reset_ccnt();

reset_pmn();

pmn_config(0,0x11);// total cycle !!!!!!!!!!

pmn_config(1,0x68);// total instruction !!!!!!!!!!

pmn_config(2,0x04);// data cache access !!!!!!!!!!

pmn_config(3,0x03);// data cache miss !!!!!!!!!!

pmn_config(4,0x10);// branch miss-predicted !!!!!!!!!!

pmn_config(5,0x12);// Predictable branches !!!!!!!!!!

enable_ccnt();

enable_pmn(0);

enable_pmn(1);

enable_pmn(2);

enable_pmn(3);

enable_pmn(4);

enable_pmn(5);

// motion estimation dual

for(i=0;i<len;i++)

{

pthread_create (&thread1, NULL, (void *) &matrix_multply1, (void *) &medata1);

pthread_create (&thread2, NULL, (void *) &matrix_multply2, (void *) &medata2);

pthread_join(thread1, NULL);

pthread_join(thread2, NULL);

}

disable_ccnt();

disable_pmn(0);

disable_pmn(1);

disable_pmn(2);

disable_pmn(3);

disable_pmn(4);

disable_pmn(5);

time_end = rdtsc32();

time_end1 = read_pmn(0);

time_end2 = read_pmn(1);

time_end3 = read_pmn(2);

time_end4 = read_pmn(3);

time_end5 = read_pmn(4);

time_end6 = read_pmn(5);

printf("cycle=%d\n instruction=%d\n cache access=%d\n"

"cache miss=%d\n,branch miss-predicted=%d\n"

"Predictable branches=%d\n",

(time_end1 - time_start1)/len,

(time_end2 - time_start2)/len,

(time_end3 - time_start3)/len,

(time_end4 - time_start4)/len,

(time_end5 - time_start5)/len,

(time_end6 - time_start6)/len);

}

e.g. I changed the matrix size or the loop length while the cycle number remains the same, however the timing is different using clock().

Top replies

yunwu over 11 years ago in reply to Juan Gao +1 verified

Hi, Thanks for reply. 1. Yes. I built a kernel module which enable the PMU from user level for each core on Cortex-A9. The profiling results are confirmed to be the same as perf library and PAPI tool-set...

Parents

0 Juan Gao over 11 years ago

Your code seems try to achieve different thing from your question. Are you benchmarking one main() program and run it in user space with different 'len' setting, or do you want to measure the cycle of each thread? Have you considered using the kernel profiling tool perf?
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Juan Gao over 11 years ago

Your code seems try to achieve different thing from your question. Are you benchmarking one main() program and run it in user space with different 'len' setting, or do you want to measure the cycle of each thread? Have you considered using the kernel profiling tool perf?
Cancel
Vote up 0 Vote down

Cancel

Children

0 Juan Gao over 11 years ago in reply to Juan Gao

Also you need to ensure the access to PMU in user space is allowed:
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));
Cancel
Vote up 0 Vote down

Cancel
0 yunwu over 11 years ago in reply to Juan Gao

Hi,
Thanks for the reply.
1. Yes, this code is try to run two identical thread on each core.
2. Using different 'len' is a simplier case. I also use different applications such as FFT, Motion Estimation. They are all similar issue as I described.
3. The PAPI is based on Perf Event Library. They are identical. Actually, I found this problem by using perf_event approach. At the beginning, I doubt the perf_event cannot correctly profile the code. After using the PMU through arm instruction, it confirmed this situation.
Cancel
Vote up 0 Vote down

Cancel
+1 yunwu over 11 years ago in reply to Juan Gao

Hi,
Thanks for reply.
1. Yes. I built a kernel module which enable the PMU from user level for each core on Cortex-A9. The profiling results are confirmed to be the same as perf library and PAPI tool-set when using single core.
2. The instruction codes are obtained from the example instructions from DS-5.
Cancel
Vote up +1 Vote down

Cancel