This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON vs VFP usage

Note: This was originally posted on 29th August 2011 at http://forums.arm.com

Hi,

Could I use NEON and VFP at the same time in my application?
What would be the downsides of that?

I read also in the documentation that the compilation flags are as following:
GCC
-mfpu=neon -mfloat-abi=softfp
-mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

Do this control just the usage of NEON intrinsics? Does specifying th option for one(e.g. neon)
prevents me from using the other(e.g. vfp) directly in the code?

Also specifying "softfp" seems to incur some overhead in my application(at least from the preliminary benchmarks).
I tried to use the "hard" option but then I have linkage error as the runtime libraries are not built with support for that.
Could I get somewhere the runtime libraries built with "hard" or do I have to do it myself?

Thanks
  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com


    Could I use NEON and VFP at the same time in my application?
    What would be the downsides of that?



    You can there is no technical problem.
    But don't believe you'll able to optimize a code. Mixing NEON and VPf instructions will give you poor performance. (On cortex A8! On Cortex A9 I don't know)

    VPf do not have instruction queue like NEON, so every VPf instruction need to wait for the NEON queue to be empty.
    So this is most of case  not a good idea to use both together.

    Etienne



  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com

    The --aapcs flag changes the procedure call standard (or sub-standard) being used.  In this case from the default "hard fp" linkage to "soft fp" linkage.  What this means is practise is how parameters and return values are passed.  With hard fp, float types (float and double) will be passed in VFP/NEON registers.  With softfp float types will be passed in general purpose registers.
  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com


    ...
    VPf do not have instruction queue like NEON, so every VPf instruction need to wait for the NEON queue to be empty.
    So this is most of case  not a good idea to use both together.



    Thanks for the insights. I just validated myself the fact that I can use both of them together by making a small app.
    After all it seems many people agree on the fact that is not such a good idea performance wise.
  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com


    The --aapcs flag changes the procedure call standard (or sub-standard) being used.  In this case from the default "hard fp" linkage to "soft fp" linkage.  What this means is practise is how parameters and return values are passed.  With hard fp, float types (float and double) will be passed in VFP/NEON registers.  With softfp float types will be passed in general purpose registers.


    OK. Thanks.
    That is what I also understood.

    I tried to use hard fp for my application, but I got linking error which were more or less expected as the
    calling conventions do not match between my app's own object files and the C++ libraries. I was wondering if
    there are runtime C++ libraries available to use which were built using hard fp or if there is any kind of
    workaround for this situation(apart from me building my own versions of the runtime libs). As I understood from
    documentation and forum posts I read using softfp(float types arguments passed in GPRs) incurs a certain penalty in performance.
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com


    ...
    VPf do not have instruction queue like NEON, so every VPf instruction need to wait for the NEON queue to be empty.
    So this is most of case  not a good idea to use both together.
    ...


    So I'm coming back to this as I'm having some hard time regarding it.
    I read wikis, forum posts and some blog posts and everybody seems to agree that using NEON is better than using VFP
    or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea;
    I'm not 100% sure yet if this applies in the context of the entire application or just to specific places(functions) in code.

    So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of
    trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further
    instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
    in which calculations are based on double precision floating point. Using NEON as the FPU gives completely
    inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works
    without a hitch when built for x86). So I changed my calculations to use single precision floating point as is
    documented that NEON does not handle double precision. My benchmarks still don't give the proper results
    (and what's worst is that now it does not work anymore on x86). So I'm almost completely lost: on one hand I want to use
    NEON for the SIMD capabilities and using it as the FPU does not provide the proper results on the other mixing it
    with the VFP does not seem a very good idea.
    Any advice in this area will be greatly appreciated !!

    I found on this wiki a summary of what should be done for floating point optimization and I tried to follow it.
    "
    • Only use single precision floating point
    • Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
    • Minimize Conditional Branches
    • Enable RunFast mode
    For softfp:

    • Inline floating point code (unless its very large)
    • Pass FP arguments via pointers instead of by value and do integer work in between function calls.
    "

    As a result part of my compiler command line is now:
    -O3 -mcpu=cortex-a9 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp

    I cannot use hard for the float ABI as I cannot link with the libraries I have available.

    Should I follow  the above guidelines? Most of them make sense to me(except the "runfast mode"
    which I don't know exactly what's supposed to do) but I'm not sure of anything right now.

    Thanks
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com

    First see if you can actual get your code working in single precision. If you need double precision then NEON won't be an option, period.

    The advice given applies to Cortex-A8 more than Cortex-A9. VFP on Cortex-A9 is roughly as fast as the NEON equivalents in terms of issue rate and latency. If your code doesn't lend itself to vectorization then you're probably better off sticking with VFP, unless you want to have good performance on Cortex-A8. Mixing the two is probably a bad idea on both processors.

    I have found myself mixing VFP and NEON in one instance, where I want a large integer reciprocal on Cortex-A8 VFP can do it (not complete 64-bit, but close enough) faster than I could with vrecpe plus Newton Raphson steps. But the VFP divide instruction blocks further NEON instructions from executing. On the other hand, you can execute integer instructions, so if you can schedule these during the divide you can recover a lot of the time.

    One interesting thing is that on Cortex-A8 I've found it to be faster to convert from floating point to fixed and vice-versa in software than it is using VFP, presuming you can ignore inf/NaN/denormal. This is especially true if you have a fixed mantissa, like if the number has been normalized. If the data is already in VFP/NEON registers I haven't found any penalty in switching to using NEON to work on it instead of VFP.