This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Performance Counters on ARM1176JZF-S

Note: This was originally posted on 14th August 2013 at http://forums.arm.com

Hi,

I have a question about the performance counters on an ARM1176JZF-S. Does the event 0xB count L1 or L2 data cache misses?
  • Note: This was originally posted on 19th August 2013 at http://forums.arm.com


    You running bare-metal?  As you can't typically access the PMU directly from user space under Linux.

    It is possible to access the PMU directly from user space if a specific bit in the "Secure User and Non-secure Access Validation Control Register" is set in kernel mode (see http://sandsoftwares...ent-and-tuning/).


    You should probably make "c" and/or a volatile, even at a low optimization level the compiler might eliminate your code otherwise.

    I just tried that, but it didn't make a difference. Also, the last statement "printf("%d\n",c);" should prevent a and c from being eliminated.
  • Note: This was originally posted on 21st August 2013 at http://forums.arm.com


    If you are running Linux or some other "big OS" then it is likely that your application is getting interrupted while it is running (timer threads, interrupts, background processes, etc). The counters are global to the core so you may well find random "other stuff" runs between your counter reset and the counter sampling if you were pre-empted between the two.

    /proc/sys/kernel/sched_min_granularity_ns is set to 750,000 ns. Assuming that a cache miss takes no more than, say, 500 ns, and there are about 1024 memory accesses, then most iterations of the loop shouldn't get interrupted.


    Secondly, it is always worth double checking the disassembly to check the instruction ordering is looking sensible too. The "volatile" part in "asm volatile" ensures the compiler doesn't optimize it out, but doesn't provide any kind of compiler barrier / memory barrier semantics so the compiler might reorder instructions. With -O0 it is unlikely, but always worth checking. Sticking "memory" in the register clobber list should stop that happening.

    I haven taken a look at the disassembly, and I couldn't find any unexpected reorderings.
  • Note: This was originally posted on 20th August 2013 at http://forums.arm.com

    If you are running Linux or some other "big OS" then it is likely that your application is getting interrupted while it is running (timer threads, interrupts, background processes, etc). The counters are global to the core so you may well find random "other stuff" runs between your counter reset and the counter sampling if you were pre-empted between the two.

    Some of the newer kernels include support for hardware counters in the perf infrastructure, and that can include support for context switching the counters when processes change.

    Secondly, it is always worth double checking the disassembly to check the instruction ordering is looking sensible too. The "volatile" part in "asm volatile" ensures the compiler doesn't optimize it out, but doesn't provide any kind of compiler barrier / memory barrier semantics so the compiler might reorder instructions. With -O0 it is unlikely, but always worth checking. Sticking "memory" in the register clobber list should stop that happening.
  • Note: This was originally posted on 15th August 2013 at http://forums.arm.com

    The ARM1176JZ(F)-S doesn't have a built-in L2 cache.  So any cache related stats from the processor's PMU will be about the L1 caches.

    It's quite possible there's a non-integrated L2 cache (I'm guessing your using a Raspberry Pi, but I'm afraid I don't know whether the chip on that board does/doesn't have one).  If it is present, it might have it's performance counters.
  • Note: This was originally posted on 19th August 2013 at http://forums.arm.com

    Couple of thoughts...

    You running bare-metal?  As you can't typically access the PMU directly from user space under Linux.

    You should probably make "c" and/or a volatile, even at a low optimization level the compiler might eliminate your code otherwise.

    I'd suggest taking a look at the assembler code the compiler has generated, to make sure it's what you expected.