Profiling using DS5

Hi ,

I am using timer counter ( CNTPCT_EL0, ds5 ultimate armv8 example ) from DS5  ultimate to profile a sample code that runs on ARMv8 processor.

I got very good optimization for my code. I am expecting larger value ( Because  large delay of vector instruction of armv7  ). I have read that ARMv8 processor have out of order execution. The processor rearrange the instruction in the run time for better performance. Could this be the reason?

How much i can relay on DS-5 counter? Has anyone compared the performance between arm fast model and real hardware?

Is there any documents that describe the instruction cycle timing of ARMv8 arm  and NEON  instructions set?

Thanks in advance..