This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON vs VFP usage

Note: This was originally posted on 29th August 2011 at http://forums.arm.com

Hi,

Could I use NEON and VFP at the same time in my application?
What would be the downsides of that?

I read also in the documentation that the compilation flags are as following:
GCC
-mfpu=neon -mfloat-abi=softfp
-mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

Do this control just the usage of NEON intrinsics? Does specifying th option for one(e.g. neon)
prevents me from using the other(e.g. vfp) directly in the code?

Also specifying "softfp" seems to incur some overhead in my application(at least from the preliminary benchmarks).
I tried to use the "hard" option but then I have linkage error as the runtime libraries are not built with support for that.
Could I get somewhere the runtime libraries built with "hard" or do I have to do it myself?

Thanks
Parents
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com

    First see if you can actual get your code working in single precision. If you need double precision then NEON won't be an option, period.

    The advice given applies to Cortex-A8 more than Cortex-A9. VFP on Cortex-A9 is roughly as fast as the NEON equivalents in terms of issue rate and latency. If your code doesn't lend itself to vectorization then you're probably better off sticking with VFP, unless you want to have good performance on Cortex-A8. Mixing the two is probably a bad idea on both processors.

    I have found myself mixing VFP and NEON in one instance, where I want a large integer reciprocal on Cortex-A8 VFP can do it (not complete 64-bit, but close enough) faster than I could with vrecpe plus Newton Raphson steps. But the VFP divide instruction blocks further NEON instructions from executing. On the other hand, you can execute integer instructions, so if you can schedule these during the divide you can recover a lot of the time.

    One interesting thing is that on Cortex-A8 I've found it to be faster to convert from floating point to fixed and vice-versa in software than it is using VFP, presuming you can ignore inf/NaN/denormal. This is especially true if you have a fixed mantissa, like if the number has been normalized. If the data is already in VFP/NEON registers I haven't found any penalty in switching to using NEON to work on it instead of VFP.
Reply
  • Note: This was originally posted on 1st September 2011 at http://forums.arm.com

    First see if you can actual get your code working in single precision. If you need double precision then NEON won't be an option, period.

    The advice given applies to Cortex-A8 more than Cortex-A9. VFP on Cortex-A9 is roughly as fast as the NEON equivalents in terms of issue rate and latency. If your code doesn't lend itself to vectorization then you're probably better off sticking with VFP, unless you want to have good performance on Cortex-A8. Mixing the two is probably a bad idea on both processors.

    I have found myself mixing VFP and NEON in one instance, where I want a large integer reciprocal on Cortex-A8 VFP can do it (not complete 64-bit, but close enough) faster than I could with vrecpe plus Newton Raphson steps. But the VFP divide instruction blocks further NEON instructions from executing. On the other hand, you can execute integer instructions, so if you can schedule these during the divide you can recover a lot of the time.

    One interesting thing is that on Cortex-A8 I've found it to be faster to convert from floating point to fixed and vice-versa in software than it is using VFP, presuming you can ignore inf/NaN/denormal. This is especially true if you have a fixed mantissa, like if the number has been normalized. If the data is already in VFP/NEON registers I haven't found any penalty in switching to using NEON to work on it instead of VFP.
Children
No data