This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

PMU in arm11 results

Hi,

I am programming raspbery pi model b ARM1176 bare metal (in assembly and c). I need to calculate the clock cycles used to execute an assembly code.

I am using the following code for PMU counter:

mov r0,#1
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Reset Cycle Counter */
mov r0,#5
MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register
/* Meaure */
MRC p15, 0, r0, c15, c12, 1 @ Read Cycle Counter Register
<MY CODES>
MRC p15, 0, r1, c15, c12, 1 @ Read Cycle Counter Register

From this if I have

add r3,#3

in place of my code i get r1=8 and r0=0, which seems correct since arm11 has 8 pipeline stages and it takes 8 clock cycles to execute it.

But when I add more instructions I am getting ridiculous results like

add r3,#3

add r4,#1

r0=0,r1=97/96/94 (the result of r1 should also be constant!!!)

I am using uart to see results of registers on minicom. I have aatached my code files

10261.zip

Top replies

Yasuhiko Koumoto over 8 years ago in reply to Muhammad Ali +1 verified

Hi, I recommend to measure the big iterations for an event. For example, how about the following. MRC p15, 0, r0, c15, c12, 1 @ Read Cycle Counter Register mov r2,#0x10000 1: subs r2,r2,#1 ...

Parents

0 Yasuhiko Koumoto over 8 years ago
Hi,

because I don't have ARM11 board, I tested the PMU by Cortex-A9.
I measured the elapsed time with the following format.
MRC p15, 0, r0, c9, c13, 0 -- in case of ARMv7-A <N x add rn,#0> MRC p15, 0, r1, c9, c13, 0 -- in case of ARMv7-A
"MRC p15, 0, r0 , c9, c13, 0" of ARMv7-A is identical to "MRC p15, 0, r0, c15, c12, 1" of ARM11.
The results were the below.

[Cache OFF case]
N=0: 4 cycles -> 0 (to subtract the N=0 case)
N=1: 4 cycles -> 0
N=2:18 cycles -> 14
N=3:20 cycles -> 16
N=4: 7 cycles -> 3
N=5:21 cycles -> 17
N=6:22 cycles -> 18
N=7:22 cycles -> 18
N=8:22 cycles -> 18
[Cache ON case]
N=0: 4 cycles -> 0 (to subtract the N=0 case)
N=1: 4 cycles -> 0
N=2: 6 cycles -> 2
N=3: 6 cycles -> 2
N=4: 7 cycles -> 3
N=5: 7 cycles -> 3
N=6: 7 cycles -> 3
N=7: 8 cycles -> 4
N=8: 8 cycles -> 4
From the experiment, the results were not linear and there would be some variations. Especially in Cache OFF case, the variation would be big.
As Cortex-A9 equips the superscalar, the results were not linear also by its reason.
I think the performance cycle counter would not be appropriate for the measurement of the few cycle events.
Best regards,
Yasuhiko Koumoto.
Cancel
Up 0 Down

Cancel

Reply

0 Yasuhiko Koumoto over 8 years ago
Hi,

because I don't have ARM11 board, I tested the PMU by Cortex-A9.
I measured the elapsed time with the following format.
MRC p15, 0, r0, c9, c13, 0 -- in case of ARMv7-A <N x add rn,#0> MRC p15, 0, r1, c9, c13, 0 -- in case of ARMv7-A
"MRC p15, 0, r0 , c9, c13, 0" of ARMv7-A is identical to "MRC p15, 0, r0, c15, c12, 1" of ARM11.
The results were the below.

[Cache OFF case]
N=0: 4 cycles -> 0 (to subtract the N=0 case)
N=1: 4 cycles -> 0
N=2:18 cycles -> 14
N=3:20 cycles -> 16
N=4: 7 cycles -> 3
N=5:21 cycles -> 17
N=6:22 cycles -> 18
N=7:22 cycles -> 18
N=8:22 cycles -> 18
[Cache ON case]
N=0: 4 cycles -> 0 (to subtract the N=0 case)
N=1: 4 cycles -> 0
N=2: 6 cycles -> 2
N=3: 6 cycles -> 2
N=4: 7 cycles -> 3
N=5: 7 cycles -> 3
N=6: 7 cycles -> 3
N=7: 8 cycles -> 4
N=8: 8 cycles -> 4
From the experiment, the results were not linear and there would be some variations. Especially in Cache OFF case, the variation would be big.
As Cortex-A9 equips the superscalar, the results were not linear also by its reason.
I think the performance cycle counter would not be appropriate for the measurement of the few cycle events.
Best regards,
Yasuhiko Koumoto.
Cancel
Up 0 Down

Cancel

Children

0 Muhammad Ali over 8 years ago in reply to Yasuhiko Koumoto

Hi,
Can you suggest any alternative way to measure clock cycles/CPI of an assembly program?
Thank you.
Cancel
Up 0 Down

Cancel

+1 Yasuhiko Koumoto over 8 years ago in reply to Muhammad Ali

Hi,

I recommend to measure the big iterations for an event.
For example, how about the following.

MRC p15, 0, r0, c15, c12, 1 @ Read Cycle Counter Register  
mov r2,#0x10000
1:
subs r2,r2,#1              @ empty loop
bne 1b
MRC p15, 0, r1, c15, c12, 1 @ Read Cycle Counter Register  
sub r3,r1,r0
MRC p15, 0, r0, c15, c12, 1 @ Read Cycle Counter Register  
mov r2,#0x10000
1:
add r3,#1
subs r2,r2,#1              @ target instruction(s)
bne 1b
MRC p15, 0, r1, c15, c12, 1 @ Read Cycle Counter Register  
sub r0,r1,r0
sub r0,r0,r3
mov r0,r0,lsr #16 @(i.e. division by 0x10000)

Best regards,
Yasuhiko Koumoto.

0 Muhammad Ali over 8 years ago in reply to Yasuhiko Koumoto

Hi,
Thank you for your answers. They are great help.
In your first answer can you please explain what are Cache on/off case and how did you implement it? Secondly, what is the reason that we are getting non-linear results?
Thanks.
Cancel
Up 0 Down

Cancel
0 Yasuhiko Koumoto over 8 years ago in reply to Muhammad Ali

Hi,
I use my Cortex-A9 board with baremetal environment.
The means of "Cache ON case" is to measure the performance when both L1 instruction and data caches are enabled.
The means of "Cache OFF case" is to measure the performance when both L1 instruction and data caches are disabled.
In the Cache OFF case, the variations would be bigger because of many execution hazards (I guess).
Regarding no-linear results, there would be two reasons considered.
First, as Cortex-A9 equips a two way superscaler, the results would be increased by every two instructions.
Second, it might be usual that there would be some errors to read a timer.
Best regards,
Yasuhiko Koumoto.
Cancel
Up 0 Down

Cancel