This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to read PMU correctly for multi-thread on dual core Cortex-A9 under userspace

yunwu over 11 years ago

Hi,

I have enabled the userspace pmu access by building a kernel module for both core on Cortex-a9. Then I follow the standard procedure of pmu counting:

1. Disable performance counters

2. Set cycle counter tick rate

3. Reset performance counters

4. Enable performance counters

5. Call function to profile

6. Disable performance counters

7. Read out performance counters

8. Check that performance counters did not overflow

The program can successfully read the pmu counter values without overflow. The problem is:

When only single thread is used for one core, the cycle counter and programmable counter can give the right cycle numbers (matched PAPI result). However, the value from the cycle counter is always the same for different profiling codes when two separate threads running on both core (the cycle number also remains the same for different profiling codes using PAPI). By the way, I am using the pthread library to create two thread running two cores through the processor affinity and the platform is ZYNQ 7000 SoC.

The situation happens not only when using the pmu but also PAPI. Did I miss some important steps to read PMU correctly when multiple thread on dual core is used?

Thanks.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The example profiling code is like this:

matrix_multply1(){

cpu_set_t cpuset;

cpu_set_t cpuget;

CPU_ZERO(&cpuset);

CPU_SET(1, &cpuset);

if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

fprintf(stderr, "set thread affinity failed\n");

}

if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

printf("can not get thread affinity!\n");

}

if(CPU_ISSET(1, &cpuget)){

printf("i am running on processor 1\n");

}

......

}

matrix_multply2(){

cpu_set_t cpuset;

cpu_set_t cpuget;

CPU_ZERO(&cpuset);

CPU_SET(1, &cpuset);

if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

fprintf(stderr, "set thread affinity failed\n");

}

if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

printf("can not get thread affinity!\n");

}

if(CPU_ISSET(1, &cpuget)){

printf("i am running on processor 1\n");

}

......

}

main(){

......

disable_ccnt();

disable_pmn(0);

disable_pmn(1);

disable_pmn(2);

disable_pmn(3);

disable_pmn(4);

disable_pmn(5);

reset_ccnt();

reset_pmn();

pmn_config(0,0x11);// total cycle !!!!!!!!!!

pmn_config(1,0x68);// total instruction !!!!!!!!!!

pmn_config(2,0x04);// data cache access !!!!!!!!!!

pmn_config(3,0x03);// data cache miss !!!!!!!!!!

pmn_config(4,0x10);// branch miss-predicted !!!!!!!!!!

pmn_config(5,0x12);// Predictable branches !!!!!!!!!!

enable_ccnt();

enable_pmn(0);

enable_pmn(1);

enable_pmn(2);

enable_pmn(3);

enable_pmn(4);

enable_pmn(5);

// motion estimation dual

for(i=0;i<len;i++)

{

pthread_create (&thread1, NULL, (void *) &matrix_multply1, (void *) &medata1);

pthread_create (&thread2, NULL, (void *) &matrix_multply2, (void *) &medata2);

pthread_join(thread1, NULL);

pthread_join(thread2, NULL);

}

disable_ccnt();

disable_pmn(0);

disable_pmn(1);

disable_pmn(2);

disable_pmn(3);

disable_pmn(4);

disable_pmn(5);

time_end = rdtsc32();

time_end1 = read_pmn(0);

time_end2 = read_pmn(1);

time_end3 = read_pmn(2);

time_end4 = read_pmn(3);

time_end5 = read_pmn(4);

time_end6 = read_pmn(5);

printf("cycle=%d\n instruction=%d\n cache access=%d\n"

"cache miss=%d\n,branch miss-predicted=%d\n"

"Predictable branches=%d\n",

(time_end1 - time_start1)/len,

(time_end2 - time_start2)/len,

(time_end3 - time_start3)/len,

(time_end4 - time_start4)/len,

(time_end5 - time_start5)/len,

(time_end6 - time_start6)/len);

}

e.g. I changed the matrix size or the loop length while the cycle number remains the same, however the timing is different using clock().

Top replies

yunwu over 11 years ago in reply to Juan Gao +1 verified

Hi, Thanks for reply. 1. Yes. I built a kernel module which enable the PMU from user level for each core on Cortex-A9. The profiling results are confirmed to be the same as perf library and PAPI tool-set...