I am curious to know how far Cortex-A78 goes with concurrent execution of (some) NEON instructions.
-------------------------------------------
Example 1:
/* Flush pipeline & disable ISRs */ SCST_PREPARE_PIPELINE
/* ABS - 128-bit operation */ ABS V31.2D,V0.2D /* Pipeline V0 */ ABS V30.2D,V1.2D /* Pipeline V1 */ ABS V29.2D,V2.2D /* Pipeline V0 */ ABS V28.2D,V2.2D /* Pipeline V1 */
I assume that line 1 and 3 goes to pipeline V0, line 2 and 4 to pipeline V1.
Then I think line 1 and line 2 executes concurrently in one clock cycle. Line 3 and line 4 executes concurrently in one clock cycle.
So the code is done in 2 clock cycles.
Is it correct ?
----------------------------------------------
/* ABS - 64-bit operation*/ ABS V16.2S,V3.2S /* Pipeline V0 */ ABS V15.2S,V3.2S /* Pipeline V1 */ ABS V14.2S,V3.2S /* Pipeline V0 */ ABS V14.2S,V3.2S /* Pipeline V1 */
Now there are two vector execution units in Cortex-A78, each is 128 bit in size.
Does that mean that NEON code using 64 bit operations can execute 4 NEON instructions in one clock cycle ?
In other words, the above code is done in 1 cycle.
Thanks for the answer.
P.S. The code is an example of our special code, please, do not ask why we need it or why don't we write it differently.