R5 vs A9 Performances

Hello guys,
I've been running the same code (that you can find here https://gist.github.com/poz1/1714ddd68da5816624d6867ad6cc5d98 ) on an R5 Board and an A9 Board.
Optimisations are enabled and my goal was to find the "right clock" for the A9 in order to obtain the same performances of the R5.

I know they are conceptually different but I was expecting to find that the A9 (at 650Mhz) to be faster than the R5 (at 500Mhz).

Instead the outputs I got are:

- R5 

Starting computation
Output took 9879627907556208991 clock cycles.
Output took 32976237.10 us.

 - A9 

Starting computation
Output took 36834640184 clock cycles.
Output took 56668677.21 us.

I am puzzled because the R5 uses much more clock cycles but takes half the time (???) to complete.

Do you have any idea of how could be explained?
Thank you :)

