This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Understanding VPf and NEON link

Note: This was originally posted on 14th March 2011 at http://forums.arm.com

I'd like to understand exactly how NEON can be used instead of VPf.

I understand that
FADDS
is now replaced by
VADD.f32

But I have some questions about that.
- Does the 2 syntax have the same memory representation (do they have the same hexadecimal representation) ??? (I guess Yes, but i'd like to be sure)
- FADDS can be conditionnal !!! Can anybody give me an example ?
- if FADDS can be conditionnal then VADD.f32 should be too ! What is the correct syntaxe for a conditionnal VADD.f32
- FADDS is now VADD.f32 and is executed into NEON pipeline. Does it mean that VADD.f32 (and FADDS) execute in 1 cycle instead of 9.

Is I'm right,
FADDD is replaced by VADD.f64 but is not executed into NEON pipeline, so the VPF cycle table must be used !!!
FADDD not seems to be a conditionnal instruction! Are we ok about that ?

Thank's
Etienne
  • Note: This was originally posted on 14th March 2011 at http://forums.arm.com

    Ok Thank's

    I do not uderstand what you want to say by
    "[color=#222222][size=2]Some have separate (ish) blocks for each, others will have more tightly integrated blocks." ?[/size][/color]
    [size=2]
    [/size]
    [size=2]Since My previous post I notice that in the documentation[/size]
    [size=2]
    [/size]
    [size=2]Each VFP instruction takes 7 cycles to execute in the NFP pipeline because of this restriction.[/size]
    [size=2]
    [/size][size=2]
    [/size]
    [size=2]
    [/size]
    [size=2]So finaly it seem's that the optimisation is not so interesting.[/size]
    [size=2]
    [/size]
    [size=2]FADDS take 9-10 cycles (on VPF execution)[/size]
    [size=2]VADD.f32 take 7 cycles. (on NEON execution)[/size]
    [size=2]
    [/size]
    [size=2]It seems that the fastest way to use 32 bit floating instruction is to use NEON with 64 bit registers[/size]
    [size=2]VADD.f32 d0, d1, d2 while take only 1 cycle.[/size]
    [size=2]
    [/size]
    [size=2]Let suppose that the 32 highest bit are loose.[/size]
    [size=2]
    [/size]
    [size=2]I will make so tests this night !!![/size]
  • Note: This was originally posted on 14th March 2011 at http://forums.arm.com

    Ok I've made so test...

    In this code


    fmsr            s14, r0
    ...
    fmuls   s0, s14, s14
    ...
    fmrs   r0, s0
    mov   pc, lr



    fmuls take 12 cycles. this is the same time as vmul.f32 (because as you said, this is the same instruction)
    I don't know if it use NEON pipeline or not. I suppose NO because it should take only 7 cycles if it was the case.

    and in this code



    fmsr            s14, r0
    ...
    vmul.f32  d0, d7, d7
    ...
    fmrs   r0, s0
    mov   pc, lr


    The vmul.f32 take only 1 cycle !!!

    What can be faster ???
  • Note: This was originally posted on 17th March 2011 at http://forums.arm.com

    I've made some test about floating point operation.

    And...
    I've wrote My first post in english (almost english).
    http://pulsar.webshaker.net/2011/03/17/optimize-float-operation-on-cortex-a8/

    It's clear that using NEON instead of VPf can really increase floating point performance.

    My problem is that I do not succeed to activate the FastMode.
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Chdiihcd.html

    Is it a compilation parameters or is it more complicated to activate this mode.
  • Note: This was originally posted on 14th March 2011 at http://forums.arm.com

    A few years back ARM changed the syntax, never heard why.  This didn't change the VFP instructions, just how you wrote it down.  So

      FADDS  s0, s1, s2

    Becomes:

      VADD.F32 s0, s1, s2

    Same instruction, same bit pattern, just a different way of writing it down.

    NEON (or Advanced SIMD to use its other name) is that it only supports single precision.  So anything used double precision (F64) will always be VFP, not NEON.

    One thing to remember is that VFP and NEON are instruction sets definitions - not pieces of hardware.  Its up to each implementation to decide how to implement them in hardware.  Some have separate (ish) blocks for each, others will have more tightly integrated blocks.
  • Note: This was originally posted on 14th March 2011 at http://forums.arm.com

    The ARM Architecture defines what the instructions are, and what they do (ditto for VFP, NEON).  What it doesn't define is how the hardware engineers should implement any of it.  So the designers could go for a 3 stage pipeline, an 8 stage pipe, or 5000000000000 stage pipe.  As long as it functioned in the way the architecture docs define it does technically matter.  Obviously, some approaches will being better than others.  Parts of the instruction set are optional, such as a NEON and VFP, the designers have the choice to include them or not.  But again, how they do so is up to them.  They could create one hardware block that implements both.  Or, separate blocks for each.  As long as it functions correctly, its up to the designers.

    VADD.F32 d0, d1, d2 is a NEON instruction...  it will do the following   d0[31:0] = d1[31:0] + d2[31:0],   d0[63:32] = d1[63:32] + d2[62:32]

    This is where the vectored nature of NEON comes in.  The .F32 says this is 32-bit (single precision) arithmetic.  The "d" registers show that you are using 64-bit (double) registers, which hold 2x single precision values.  So you get two parallel additions.

    I'm afraid I've not player around enough with optimization to know which will be quicker.