I am running code in ADStudio using Fixed Virtual Platforms simulator, so no hardware board is connected.
I am trying to profile a sub-routine, so I count cycles for a peace of code:
int64_t prev, curr, delta; asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(prev));
int64_t prev, curr, delta;
asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(prev));
// function body
asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(curr)); delta = curr - prev;
asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(curr));
delta = curr - prev;
My compiler settings are --target=aarch64-arm-none-eabi -march=armv8-a -mcpu=cortex-a53.
--target=aarch64-arm-none-eabi -march=armv8-a -mcpu=cortex-a53
I wanted to check if compiler uses NEON instructions:
#ifdef __aarch64__ printf("--- THIS IS ARCH64 \n");#endif
#ifdef __aarch64__
printf("--- THIS IS ARCH64 \n");
#endif
#ifdef __ARM_NEON__ printf("--- THIS IS NEON \n");#endif
#ifdef __ARM_NEON__
printf("--- THIS IS NEON \n");
But it seems that it is not using neon.
1) Is my define __ARM_NEON__ wrong?
__ARM_NEON__
2) What is the default -gfpu?
-gfpu
3) How do I force neon with -gfpu?
4) When I set -gfpu=none my cycle count is THE SAME as default one. I find this rather strange, shouldn't the math heavy code be much slower? Is there an explanation?
-gfpu=none
Thanks.
Thanks Ronan.
I did enable the PMCCNTR. It worked fine, showed some cycle count numbers that made sense.
Do you know how inaccurate FVP is when it comes to cycle counter? Just an informed guess will do. Thanks.
The Fast Model technology that the FVPs are built on don't have the concept of cycles - there is some limited timing annotation you can add to the model, but at the expense of performance of the model (some requires access to the full Fast Model tooling:https://developer.arm.com/documentation/100965/1190/timing-annotationHow inaccurate is the model? The number is approximately the number of instructions - if you are running small apps that would likely fit in the cache, then it is reasonably close (maybe ~20%) to the real number, if you are dealing with a larger system that would factor cache hits and misses and L2cache etc, then the numbers become more divergent. For this reason, I tend to use it as a first pass relative comparison between implementations, rather than an absolute number.