This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON vs VFP usage

Note: This was originally posted on 29th August 2011 at http://forums.arm.com

Hi,

Could I use NEON and VFP at the same time in my application?
What would be the downsides of that?

I read also in the documentation that the compilation flags are as following:
GCC
-mfpu=neon -mfloat-abi=softfp
-mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

Do this control just the usage of NEON intrinsics? Does specifying th option for one(e.g. neon)
prevents me from using the other(e.g. vfp) directly in the code?

Also specifying "softfp" seems to incur some overhead in my application(at least from the preliminary benchmarks).
I tried to use the "hard" option but then I have linkage error as the runtime libraries are not built with support for that.
Could I get somewhere the runtime libraries built with "hard" or do I have to do it myself?

Thanks
Parents
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com


    ...
    VPf do not have instruction queue like NEON, so every VPf instruction need to wait for the NEON queue to be empty.
    So this is most of case  not a good idea to use both together.
    ...


    So I'm coming back to this as I'm having some hard time regarding it.
    I read wikis, forum posts and some blog posts and everybody seems to agree that using NEON is better than using VFP
    or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea;
    I'm not 100% sure yet if this applies in the context of the entire application or just to specific places(functions) in code.

    So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of
    trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further
    instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
    in which calculations are based on double precision floating point. Using NEON as the FPU gives completely
    inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works
    without a hitch when built for x86). So I changed my calculations to use single precision floating point as is
    documented that NEON does not handle double precision. My benchmarks still don't give the proper results
    (and what's worst is that now it does not work anymore on x86). So I'm almost completely lost: on one hand I want to use
    NEON for the SIMD capabilities and using it as the FPU does not provide the proper results on the other mixing it
    with the VFP does not seem a very good idea.
    Any advice in this area will be greatly appreciated !!

    I found on this wiki a summary of what should be done for floating point optimization and I tried to follow it.
    "
    • Only use single precision floating point
    • Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
    • Minimize Conditional Branches
    • Enable RunFast mode
    For softfp:

    • Inline floating point code (unless its very large)
    • Pass FP arguments via pointers instead of by value and do integer work in between function calls.
    "

    As a result part of my compiler command line is now:
    -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp

    I cannot use hard for the float ABI as I cannot link with the libraries I have available.

    Should I follow  the above guidelines? Most of them make sense to me(except the "runfast mode"
    which I don't know exactly what's supposed to do) but I'm not sure of anything right now.

    Thanks
Reply
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com


    ...
    VPf do not have instruction queue like NEON, so every VPf instruction need to wait for the NEON queue to be empty.
    So this is most of case  not a good idea to use both together.
    ...


    So I'm coming back to this as I'm having some hard time regarding it.
    I read wikis, forum posts and some blog posts and everybody seems to agree that using NEON is better than using VFP
    or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea;
    I'm not 100% sure yet if this applies in the context of the entire application or just to specific places(functions) in code.

    So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of
    trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further
    instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
    in which calculations are based on double precision floating point. Using NEON as the FPU gives completely
    inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works
    without a hitch when built for x86). So I changed my calculations to use single precision floating point as is
    documented that NEON does not handle double precision. My benchmarks still don't give the proper results
    (and what's worst is that now it does not work anymore on x86). So I'm almost completely lost: on one hand I want to use
    NEON for the SIMD capabilities and using it as the FPU does not provide the proper results on the other mixing it
    with the VFP does not seem a very good idea.
    Any advice in this area will be greatly appreciated !!

    I found on this wiki a summary of what should be done for floating point optimization and I tried to follow it.
    "
    • Only use single precision floating point
    • Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
    • Minimize Conditional Branches
    • Enable RunFast mode
    For softfp:

    • Inline floating point code (unless its very large)
    • Pass FP arguments via pointers instead of by value and do integer work in between function calls.
    "

    As a result part of my compiler command line is now:
    -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp

    I cannot use hard for the float ABI as I cannot link with the libraries I have available.

    Should I follow  the above guidelines? Most of them make sense to me(except the "runfast mode"
    which I don't know exactly what's supposed to do) but I'm not sure of anything right now.

    Thanks
Children
No data