Hi,
I am programming raspbery pi model b ARM1176 bare metal (in assembly and c). I need to calculate the clock cycles used to execute an assembly code.
I am using the following code for PMU counter:
From this if I have
add r3,#3
in place of my code i get r1=8 and r0=0, which seems correct since arm11 has 8 pipeline stages and it takes 8 clock cycles to execute it.
But when I add more instructions I am getting ridiculous results like
add r4,#1
r0=0,r1=97/96/94 (the result of r1 should also be constant!!!)
I am using uart to see results of registers on minicom. I have aatached my code files
Thank you for your answers. They are great help.
In your first answer can you please explain what are Cache on/off case and how did you implement it? Secondly, what is the reason that we are getting non-linear results?
Thanks.
I use my Cortex-A9 board with baremetal environment.
The means of "Cache ON case" is to measure the performance when both L1 instruction and data caches are enabled.
The means of "Cache OFF case" is to measure the performance when both L1 instruction and data caches are disabled.
In the Cache OFF case, the variations would be bigger because of many execution hazards (I guess).
Regarding no-linear results, there would be two reasons considered.
First, as Cortex-A9 equips a two way superscaler, the results would be increased by every two instructions.
Second, it might be usual that there would be some errors to read a timer.
Best regards,
Yasuhiko Koumoto.