Hi,
for a project regarding Digital Signal Processing on ARM SoCs i'm currently gathering some information about the ARM NEON engine and would need some clarification if my assumptions are correct.
I found an instruction timing table in the "Cortex-A9 NEON Media Processing Engine Technical Reference Manual" with columns like "Cycles", "Result" and "Writeback".
For example for a VMLA Advanced SIMD floating-point instruction there are these values given:
Name | Format | Cycles | Source | Result | Writeback |
VMLA | Dd,Dn,Dm | 1 | 3,2,2 | 9 | 10 |
Is it necessary to add the values of the Cycles, Result and Writeback fields to calculate the duration of the VMLA instruction, so that it takes 20 cycles in total to have the result written back to the register file or can the result be found in the register file already 10 cycles after execution of the instruction?
In other words: are the 10 cycles for the Writeback only used and needed for the Writeback or are the Result- and Execution-Cycles-durations included?
I read that with NEON it's possible to do SIMD single precision x4.
Am I assuming correct that with NEON when talking about single precision we are talking about 32-bit (IEEE-754)?
For a MAC (VMLA) this would mean 32-bit x 32-bit with a 64-bit product that is added to a 64-bit accumulator, correct?
And does the x4 mean that this can be done 4 times in parallel?
How many cycles would it take to have the 4 results in the register file then?
Thank you.
Peter Harris said:the newer Technical Reference Manuals no longer have cycle timings
Hello,
Not sure if hijacking an old post is the best way to ask my question, but since it is in reference to the absence of cycle timings in the Reference Manual for recent ARM processors, here goes:
I am interested in multi-cycle NEON instructions and the different number of execution cycles they consume for different data-widths. The A9 reference manual says that the VMUL (for example) takes 6 cycles to produce the result for 8/16-bit while it takes 7-cycles for 32-bit. On the other hand, the manuals I could find for A73/A57 etc. only mentions 4 cycles (no reference to number of bits).
Does this mean that all VMUL would take 4 cycles? Why/Why not?
Also any reference to any design specific details about the NEON units would be much appreciated. Thanks!!
Best,
Gokul