I am using timer counter ( CNTPCT_EL0, ds5 ultimate armv8 example ) from DS5 ultimate to profile a sample code that runs on ARMv8 processor.
I got very good optimization for my code. I am expecting larger value ( Because large delay of vector instruction of armv7 ). I have read that ARMv8 processor have out of order execution. The processor rearrange the instruction in the run time for better performance. Could this be the reason?
How much i can relay on DS-5 counter? Has anyone compared the performance between arm fast model and real hardware?
Is there any documents that describe the instruction cycle timing of ARMv8 arm and NEON instructions set?
Thanks in advance..