Hello guys,I've been running the same code (that you can find here https://gist.github.com/poz1/1714ddd68da5816624d6867ad6cc5d98 ) on an R5 Board and an A9 Board.Optimisations are enabled and my goal was to find the "right clock" for the A9 in order to obtain the same performances of the R5.I know they are conceptually different but I was expecting to find that the A9 (at 650Mhz) to be faster than the R5 (at 500Mhz).
Instead the outputs I got are:
- R5
Starting computationOutput took 9879627907556208991 clock cycles.Output took 32976237.10 us.
- A9
Starting computationOutput took 36834640184 clock cycles.Output took 56668677.21 us.
I am puzzled because the R5 uses much more clock cycles but takes half the time (???) to complete.Do you have any idea of how could be explained?Thank you :)
There are a lot of "uncertainties" in this code.
For example XTime. It looks like you are running the R5 test on an US+ and the A9 on a ZYNQ 7000.
Does XTime really give the number of CPU cycles? I rather think it is timer cycles. And those a likely different on different board.
Hello 42Bastian Schick and thank you for your help :)Yes, It's a ZYNQ7000 (A9) and an UltraScale+ MPSoC (R5) I've been searching after your input and found this https://www.xilinx.com/support/answers/66568.html where they say that "It works at the APU clock frequency." (that in the case of the MPSoC it's an 1.5Ghz A53 and would explain the much higher clock cycles count)So thank you for your suggestion :)What still puzzles me is the difference in time between the A9 and the R5, shouldn't they be comparable? (or at least should not be the A9 the faster one?)Thank you again :)
Did you hand-stop those times?
For the cycles you should read the PMU counters.
The R5 has no access to the A53 timers, so XTime does have a different base.
I suggest to use a dedicated timer, check its frequency with an GPIO and then use it to measure the time.
Anyway, my experience throughout all ARM cores is, that small routines just scale with the clock with a slight performance plus for those with a longer pipeline.
Hello 42Bastian Schick! First of all I want to thank you for your precious help!I created a separate timer as suggested and now everything is right :)The question about performances still remains though, why is the R5 twice the speed of A9? Could it be because R5 uses LPDDR4 (on the board I have) while the A9 has DDR3?Thank you again :)
Are you running both bare-metall?
The kind of DDRAM should not matter much, as - at least the code - runs from cache.
But the data cache size might make the difference.
Do you have ECC enabled on the CA9? If so, it has only 16bit data bus.
Yup, 16 bit, It says
EDIT: Actually, being this board FPGA, I have the option to lower the bus of R5 to 16bits, I am just unsure if it would kill the board or not :D
If baremetall try to make it run in the OCM which is 256K and should be sufficient for code and data.
But the 16bit bus seems to me an explanation why you see such a big difference between CA9/R5.
Hi,
Has the Cortex-A9 had the MMU set up and the caches enabled before running this code (from Normal memory) ???
There's been no mention of this and I have encountered this happening before....
Just a thought.
regards
Stuart
Hello, thanks for all the precious ideas :) Unfortunately I had to put the project on hold until mid february :( I will surelly test and let you know :) Thanks again,Alessandro