I am curious to know how far Cortex-A78 goes with concurrent execution of (some) NEON instructions.
-------------------------------------------
Example 1:
/* Flush pipeline & disable ISRs */ SCST_PREPARE_PIPELINE
/* ABS - 128-bit operation */ ABS V31.2D,V0.2D /* Pipeline V0 */ ABS V30.2D,V1.2D /* Pipeline V1 */ ABS V29.2D,V2.2D /* Pipeline V0 */ ABS V28.2D,V2.2D /* Pipeline V1 */
I assume that line 1 and 3 goes to pipeline V0, line 2 and 4 to pipeline V1.
Then I think line 1 and line 2 executes concurrently in one clock cycle. Line 3 and line 4 executes concurrently in one clock cycle.
So the code is done in 2 clock cycles.
Is it correct ?
----------------------------------------------
/* ABS - 64-bit operation*/ ABS V16.2S,V3.2S /* Pipeline V0 */ ABS V15.2S,V3.2S /* Pipeline V1 */ ABS V14.2S,V3.2S /* Pipeline V0 */ ABS V14.2S,V3.2S /* Pipeline V1 */
Now there are two vector execution units in Cortex-A78, each is 128 bit in size.
Does that mean that NEON code using 64 bit operations can execute 4 NEON instructions in one clock cycle ?
In other words, the above code is done in 1 cycle.
Thanks for the answer.
P.S. The code is an example of our special code, please, do not ask why we need it or why don't we write it differently.
Please see Cortex-A78 software optimization guide
- https://developer.arm.com/documentation/102160/latest/
Section 3.15 ASIMD integer instructions.