This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Questions regarding NEON

Hi,

for a project regarding Digital Signal Processing on ARM SoCs i'm currently gathering some information about the ARM NEON engine and would need some clarification if my assumptions are correct.

I found an instruction timing table in the "Cortex-A9 NEON Media Processing Engine Technical Reference Manual" with columns like "Cycles", "Result" and "Writeback".

For example for a VMLA Advanced SIMD floating-point instruction there are these values given:

Name |     Format       | Cycles | Source | Result | Writeback |

VMLA |    Dd,Dn,Dm   |     1     |   3,2,2   |     9     |       10      |

Is it necessary to add the values of the Cycles, Result and Writeback fields to calculate the duration of the VMLA instruction, so that it takes 20 cycles in total to have the result written back to the register file or can the result be found in the register file already 10 cycles after execution of the instruction?

In other words: are the 10 cycles for the Writeback only used and needed for the Writeback or are the Result- and Execution-Cycles-durations included?

I read that with NEON it's possible to do SIMD single precision x4.

Am I assuming correct that with NEON when talking about single precision we are talking about 32-bit (IEEE-754)?

For a MAC (VMLA) this would mean 32-bit x 32-bit with a 64-bit product that is added to a 64-bit accumulator, correct?

And does the x4 mean that this can be done 4 times in parallel?

How many cycles would it take to have the 4 results in the register file then?

Thank you.

Parents
  • The cycle timing tables show cycle number, not number of cycles.

    | Name | Format   | Cycles | Source | Result | Writeback |

    | VMLA | Dd,Dn,Dm | 1      | 3,2,2  | 9      | 10        |

    The "Source" column shows when the inputs are needed. In this example the multiplier inputs (Dn, Dm) are needed in cycle 2, the accumulator input is needed one cycle later in cycle 3. The accumulator result is available in cycle 9 (and can be register forwarded at this point to later instructions), and it finally hits the register file in cycle 10.

    From the point of view of scheduling code you need to look at the difference between the Source of instruction N+1 and Result of instruction N. For example, for this code sequence:

    VMLA.f16 d2, d0, d1
    VMLA.f16 d3, d0, d2
    

    ... the second instruction reads the output of the first (d2), which is available end of cycle 9 of the first instruction, as an input at the start of cycle 2. This would therefore result in a 9-2 = 7 cycle stall. In reality the cycle timing tables are very hard to use, and actually there are many fast-paths in the hardware, so they are not actually very useful (the newer Technical Reference Manuals no longer have cycle timings). It's also worth noting that NEON is an instruction set, and different CPUs have different implementation of that instruction set. Cortex-A9 has very different NEON performance to a Cortex-A7 which will itself be different from a Cortex-A57, so the timing data you has only applies to the Cortex-A9.

    Am I assuming correct that with NEON when talking about single precision we are talking about 32-bit (IEEE-754)?

    Yes single precision = 32-bit. Not all NEON behaviour is fully IEEE - most vector engines have some optimizations around denormals, infinities, NaNs, and not raising exceptions. A lot of the corner case behaviour is configurable in the control registers, but not always. See the ARM Architecture Reference Manual ISA specification for details. The scalar VFP operations are full IEEE (if they control registers are configured for that).

    I read that with NEON it's possible to do SIMD single precision x4

    For a MAC (VMLA) this would mean 32-bit x 32-bit with a 64-bit product that is added to a 64-bit accumulator, correct?

    And does the x4 mean that this can be done 4 times in parallel?

    How many cycles would it take to have the 4 results in the register file then?

    The NEON ISA allows 32-bit, 64-bit, or 128-bit data registers to be accessed. A register can be accessed as a wide variety of data types, for example, fp32 vec4, or fp16 vec8, or u8 vec 16, etc. It's very very flexible.

    Not all implementations of NEON implement the full data width, so some hardware cores will take multiple cycles to issue an instruction which would be single cycle on other implementations.

    HTH,
    Pete

Reply
  • The cycle timing tables show cycle number, not number of cycles.

    | Name | Format   | Cycles | Source | Result | Writeback |

    | VMLA | Dd,Dn,Dm | 1      | 3,2,2  | 9      | 10        |

    The "Source" column shows when the inputs are needed. In this example the multiplier inputs (Dn, Dm) are needed in cycle 2, the accumulator input is needed one cycle later in cycle 3. The accumulator result is available in cycle 9 (and can be register forwarded at this point to later instructions), and it finally hits the register file in cycle 10.

    From the point of view of scheduling code you need to look at the difference between the Source of instruction N+1 and Result of instruction N. For example, for this code sequence:

    VMLA.f16 d2, d0, d1
    VMLA.f16 d3, d0, d2
    

    ... the second instruction reads the output of the first (d2), which is available end of cycle 9 of the first instruction, as an input at the start of cycle 2. This would therefore result in a 9-2 = 7 cycle stall. In reality the cycle timing tables are very hard to use, and actually there are many fast-paths in the hardware, so they are not actually very useful (the newer Technical Reference Manuals no longer have cycle timings). It's also worth noting that NEON is an instruction set, and different CPUs have different implementation of that instruction set. Cortex-A9 has very different NEON performance to a Cortex-A7 which will itself be different from a Cortex-A57, so the timing data you has only applies to the Cortex-A9.

    Am I assuming correct that with NEON when talking about single precision we are talking about 32-bit (IEEE-754)?

    Yes single precision = 32-bit. Not all NEON behaviour is fully IEEE - most vector engines have some optimizations around denormals, infinities, NaNs, and not raising exceptions. A lot of the corner case behaviour is configurable in the control registers, but not always. See the ARM Architecture Reference Manual ISA specification for details. The scalar VFP operations are full IEEE (if they control registers are configured for that).

    I read that with NEON it's possible to do SIMD single precision x4

    For a MAC (VMLA) this would mean 32-bit x 32-bit with a 64-bit product that is added to a 64-bit accumulator, correct?

    And does the x4 mean that this can be done 4 times in parallel?

    How many cycles would it take to have the 4 results in the register file then?

    The NEON ISA allows 32-bit, 64-bit, or 128-bit data registers to be accessed. A register can be accessed as a wide variety of data types, for example, fp32 vec4, or fp16 vec8, or u8 vec 16, etc. It's very very flexible.

    Not all implementations of NEON implement the full data width, so some hardware cores will take multiple cycles to issue an instruction which would be single cycle on other implementations.

    HTH,
    Pete

Children