This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Need help in GCC intrinsics for NEON

Note: This was originally posted on 4th April 2012 at http://forums.arm.com

Hi All,


   Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?

vld3.16 {d0,d2,d4},[r0]!   
vld3.16 {d1,d3,d5},[r0]! 

Thanks,
Kiran
  • Note: This was originally posted on 9th April 2012 at http://forums.arm.com

    GCC is the same as RVCT:

    uint16x8x3_t vld3q_u16 (const uint16_t *)

    Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
    One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

    To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

    Iso
  • Note: This was originally posted on 9th April 2012 at http://forums.arm.com

    Hi Thanks for the reply.

    My actual question should have been different.

    vld3.16 {d0,d2,d4},[r0]!   
    vld3.16 {d1,d3,d5},[r0]! 
    vadd.16 q3,q0,q1

    Actually after filling the data into d0,d1 registers, i want to use them as one Q-register.
    I can do that by writing the assembly. But I want to know whether I can do the same thing using Intrinsics and how .

    I also experienced the same problem as you mentioned with GCC tools.
    But when there so not much ARM code between NEON codes or NEON intrinsics statements, then GCC is doing better in
    generating assembly with "Tighter Neon" code with out data transactions between registers and stack.
    What I observed is register abstraction to the neon variables used in intrinsics is not properly as it is doing for the ARM code.

    Please let me what I am observing is correct.


    BRs,
    Kiran Kumar





    GCC is the same as RVCT:

    uint16x8x3_t vld3q_u16 (const uint16_t *)

    Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
    One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

    To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

    Iso
  • Note: This was originally posted on 10th April 2012 at http://forums.arm.com

    Half the point of intrinsics is that they hide register allocation as it's a good thing for the compiler to handle, and closely tied to instruction scheduling which is the other point of using them rather than asm.
    I can't see a way of directly doing what you want using intrinsics - you generally have to either cast the intrinsic structure type pointers and type-pun (which is probably bad on newer compilers with strict aliasing) - or memcpy fields between them to get things in the order you want. Unfortunately compilers don't really like the aliasing of d-registers to q-registers, so this tends to be one area the code gen suffers a bit in my experience.