I'm looking for a tool to iterate faster on my ARM NEON optimizations all through software, i.e without using any hardware / dev boards. I came across ARM Development studio and its Fixed Vritual Platforms (FVPs). I am not very particular on cycle count accuracy when compared to real hardware. As long as i can get consistent cycle count numbers on multiple runs of the simulation, it will be sufficient for me to optimize my code better.
It would be good if i can select a Cortex A series processor (say A53 for now), and some memory model for the DRAM to go with it.
The FVP models replicate the PMU registers (specifically PMCCNTR_EL0 for your needs), however the numbers generated there cannot be relied on for any sort of accuracy. However if you simply wish to see improvement in your code through optimization, there may be some limited benefit that you can get from these.
https://developer.arm.com/documentation/ddi0500/e/performance-monitor-unit/aarch64-pmu-register-summary
You may find the following documents useful:
https://developer.arm.com/documentation/102474
https://developer.arm.com/documentation/102467
PS - there is also a PMU_AArch64 example provided with Arm Development Studio which you may find useful.
Thanks Ronan. This is really useful.