I am using timer counter ( CNTPCT_EL0, ds5 ultimate armv8 example ) from DS5 ultimate to profile a sample code that runs on ARMv8 processor.
I got very good optimization for my code. I am expecting larger value ( Because large delay of vector instruction of armv7 ). I have read that ARMv8 processor have out of order execution. The processor rearrange the instruction in the run time for better performance. Could this be the reason?
How much i can relay on DS-5 counter? Has anyone compared the performance between arm fast model and real hardware?
Is there any documents that describe the instruction cycle timing of ARMv8 arm and NEON instructions set?