Hi,
I have enabled the userspace pmu access by building a kernel module for both core on Cortex-a9. Then I follow the standard procedure of pmu counting:
1. Disable performance counters
2. Set cycle counter tick rate
3. Reset performance counters
4. Enable performance counters
5. Call function to profile
6. Disable performance counters
7. Read out performance counters
8. Check that performance counters did not overflow
The program can successfully read the pmu counter values without overflow. The problem is:
When only single thread is used for one core, the cycle counter and programmable counter can give the right cycle numbers (matched PAPI result). However, the value from the cycle counter is always the same for different profiling codes when two separate threads running on both core (the cycle number also remains the same for different profiling codes using PAPI). By the way, I am using the pthread library to create two thread running two cores through the processor affinity and the platform is ZYNQ 7000 SoC.
The situation happens not only when using the pmu but also PAPI. Did I miss some important steps to read PMU correctly when multiple thread on dual core is used?
Thanks.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The example profiling code is like this:
matrix_multply1(){
cpu_set_t cpuset;
cpu_set_t cpuget;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);
if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {
fprintf(stderr, "set thread affinity failed\n");
}
if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){
printf("can not get thread affinity!\n");
if(CPU_ISSET(1, &cpuget)){
printf("i am running on processor 1\n");
......
matrix_multply2(){
main(){
disable_ccnt();
disable_pmn(0);
disable_pmn(1);
disable_pmn(2);
disable_pmn(3);
disable_pmn(4);
disable_pmn(5);
reset_ccnt();
reset_pmn();
pmn_config(0,0x11);// total cycle !!!!!!!!!!
pmn_config(1,0x68);// total instruction !!!!!!!!!!
pmn_config(2,0x04);// data cache access !!!!!!!!!!
pmn_config(3,0x03);// data cache miss !!!!!!!!!!
pmn_config(4,0x10);// branch miss-predicted !!!!!!!!!!
pmn_config(5,0x12);// Predictable branches !!!!!!!!!!
enable_ccnt();
enable_pmn(0);
enable_pmn(1);
enable_pmn(2);
enable_pmn(3);
enable_pmn(4);
enable_pmn(5);
// motion estimation dual
for(i=0;i<len;i++)
{
pthread_create (&thread1, NULL, (void *) &matrix_multply1, (void *) &medata1);
pthread_create (&thread2, NULL, (void *) &matrix_multply2, (void *) &medata2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
time_end = rdtsc32();
time_end1 = read_pmn(0);
time_end2 = read_pmn(1);
time_end3 = read_pmn(2);
time_end4 = read_pmn(3);
time_end5 = read_pmn(4);
time_end6 = read_pmn(5);
printf("cycle=%d\n instruction=%d\n cache access=%d\n"
"cache miss=%d\n,branch miss-predicted=%d\n"
"Predictable branches=%d\n",
(time_end1 - time_start1)/len,
(time_end2 - time_start2)/len,
(time_end3 - time_start3)/len,
(time_end4 - time_start4)/len,
(time_end5 - time_start5)/len,
(time_end6 - time_start6)/len);
e.g. I changed the matrix size or the loop length while the cycle number remains the same, however the timing is different using clock().