This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A78 NEON instructions timing

I am curious to know how far Cortex-A78 goes with concurrent execution of (some) NEON instructions.

-------------------------------------------

Example 1:

    /* Flush pipeline & disable ISRs            */
    SCST_PREPARE_PIPELINE

    /* ABS - 128-bit operation */
    ABS     V31.2D,V0.2D    /* Pipeline V0  */ 
    ABS     V30.2D,V1.2D    /* Pipeline V1  */
    ABS     V29.2D,V2.2D    /* Pipeline V0  */
    ABS     V28.2D,V2.2D     /* Pipeline V1  */ 

I assume that line 1 and 3 goes to pipeline V0, line 2 and 4 to pipeline V1.

Then I think line 1 and line 2 executes concurrently in one clock cycle. Line 3 and line 4 executes concurrently in one clock cycle.

So the code is done in 2 clock cycles.

Is it correct ?

----------------------------------------------

Example 1:

    /* Flush pipeline & disable ISRs            */
    SCST_PREPARE_PIPELINE

    /* ABS - 64-bit operation*/
    ABS     V16.2S,V3.2S    /* Pipeline V0  */
    ABS     V15.2S,V3.2S    /* Pipeline V1  */
    ABS     V14.2S,V3.2S    /* Pipeline V0  */
    ABS     V14.2S,V3.2S    /* Pipeline V1  */

I assume that line 1 and 3 goes to pipeline V0, line 2 and 4 to pipeline V1.

Now there are two vector execution units in Cortex-A78, each is 128 bit in size.

Does that mean that NEON code using 64 bit operations can execute 4 NEON instructions in one clock cycle ?

In other words, the above code is done in 1 cycle.

Is it correct ?

Thanks for the answer.

P.S. The code is an example of our special code, please, do not ask why we need it or why don't we write it differently.