This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

PMU in arm11 results

Hi,

I am programming raspbery pi model b ARM1176 bare metal (in assembly and c). I need to calculate the clock cycles used to execute an assembly code.

I am using the following code for PMU counter:

  1.   mov r0,#1 
  2.   MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register 
  3.   /* Reset Cycle Counter */ 
  4.   mov r0,#5 
  5.   MCR p15, 0, r0, c15, c12, 0 ; Write Performance Monitor Control Register 
  6.   /* Meaure */ 
  7.   MRC p15, 0, r0, c15, c12, 1 @ Read Cycle Counter Register 
  8.   <MY CODES> 
  9.   MRC p15, 0, r1, c15, c12, 1 @ Read Cycle Counter Register 

From this if I have

add r3,#3

in place of my code i get r1=8 and r0=0, which seems correct since arm11 has 8 pipeline stages and it takes 8 clock cycles to execute it.

But when I add more instructions I am getting ridiculous results like

add r3,#3

add r4,#1

r0=0,r1=97/96/94 (the result of r1 should also be constant!!!)

I am using uart to see results of registers on minicom. I have aatached my code files

10261.zip
Parents
  • Hi,


    because I don't have ARM11 board, I tested the PMU by Cortex-A9.
    I measured the elapsed time with the following format.

    MRC p15, 0, r0, c9, c13, 0 -- in case of ARMv7-A
    <N x add rn,#0>
    MRC p15, 0, r1, c9, c13, 0 -- in case of ARMv7-A
    

    "MRC p15, 0, r0 , c9, c13, 0" of ARMv7-A is identical to "MRC p15, 0, r0, c15, c12, 1" of ARM11.

    The results were the below.


    [Cache OFF case]
    N=0: 4 cycles -> 0 (to subtract the N=0 case)
    N=1: 4 cycles -> 0
    N=2:18 cycles -> 14
    N=3:20 cycles -> 16
    N=4: 7 cycles -> 3
    N=5:21 cycles -> 17
    N=6:22 cycles -> 18
    N=7:22 cycles -> 18
    N=8:22 cycles -> 18

    [Cache ON case]
    N=0: 4 cycles -> 0 (to subtract the N=0 case)
    N=1: 4 cycles -> 0
    N=2: 6 cycles -> 2
    N=3: 6 cycles -> 2
    N=4: 7 cycles -> 3
    N=5: 7 cycles -> 3
    N=6: 7 cycles -> 3
    N=7: 8 cycles -> 4
    N=8: 8 cycles -> 4

    From the experiment, the results were not linear and there would be some variations. Especially in Cache OFF case, the variation would be big.
    As Cortex-A9 equips the superscalar, the results were not linear also by its reason.
    I think the performance cycle counter would not be appropriate for the measurement of the few cycle events.

    Best regards,
    Yasuhiko Koumoto.

Reply
  • Hi,


    because I don't have ARM11 board, I tested the PMU by Cortex-A9.
    I measured the elapsed time with the following format.

    MRC p15, 0, r0, c9, c13, 0 -- in case of ARMv7-A
    <N x add rn,#0>
    MRC p15, 0, r1, c9, c13, 0 -- in case of ARMv7-A
    

    "MRC p15, 0, r0 , c9, c13, 0" of ARMv7-A is identical to "MRC p15, 0, r0, c15, c12, 1" of ARM11.

    The results were the below.


    [Cache OFF case]
    N=0: 4 cycles -> 0 (to subtract the N=0 case)
    N=1: 4 cycles -> 0
    N=2:18 cycles -> 14
    N=3:20 cycles -> 16
    N=4: 7 cycles -> 3
    N=5:21 cycles -> 17
    N=6:22 cycles -> 18
    N=7:22 cycles -> 18
    N=8:22 cycles -> 18

    [Cache ON case]
    N=0: 4 cycles -> 0 (to subtract the N=0 case)
    N=1: 4 cycles -> 0
    N=2: 6 cycles -> 2
    N=3: 6 cycles -> 2
    N=4: 7 cycles -> 3
    N=5: 7 cycles -> 3
    N=6: 7 cycles -> 3
    N=7: 8 cycles -> 4
    N=8: 8 cycles -> 4

    From the experiment, the results were not linear and there would be some variations. Especially in Cache OFF case, the variation would be big.
    As Cortex-A9 equips the superscalar, the results were not linear also by its reason.
    I think the performance cycle counter would not be appropriate for the measurement of the few cycle events.

    Best regards,
    Yasuhiko Koumoto.

Children