This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Need help in GCC intrinsics for NEON

Note: This was originally posted on 4th April 2012 at http://forums.arm.com

Hi All,


   Can somebody tell me what are the equivalent GCC and ARM intrinsics for generating the below NEON ASM statements?

vld3.16 {d0,d2,d4},[r0]!   
vld3.16 {d1,d3,d5},[r0]! 

Thanks,
Kiran
Parents
  • Note: This was originally posted on 9th April 2012 at http://forums.arm.com

    Hi Thanks for the reply.

    My actual question should have been different.

    vld3.16 {d0,d2,d4},[r0]!   
    vld3.16 {d1,d3,d5},[r0]! 
    vadd.16 q3,q0,q1

    Actually after filling the data into d0,d1 registers, i want to use them as one Q-register.
    I can do that by writing the assembly. But I want to know whether I can do the same thing using Intrinsics and how .

    I also experienced the same problem as you mentioned with GCC tools.
    But when there so not much ARM code between NEON codes or NEON intrinsics statements, then GCC is doing better in
    generating assembly with "Tighter Neon" code with out data transactions between registers and stack.
    What I observed is register abstraction to the neon variables used in intrinsics is not properly as it is doing for the ARM code.

    Please let me what I am observing is correct.


    BRs,
    Kiran Kumar





    GCC is the same as RVCT:

    uint16x8x3_t vld3q_u16 (const uint16_t *)

    Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
    One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

    To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

    Iso
Reply
  • Note: This was originally posted on 9th April 2012 at http://forums.arm.com

    Hi Thanks for the reply.

    My actual question should have been different.

    vld3.16 {d0,d2,d4},[r0]!   
    vld3.16 {d1,d3,d5},[r0]! 
    vadd.16 q3,q0,q1

    Actually after filling the data into d0,d1 registers, i want to use them as one Q-register.
    I can do that by writing the assembly. But I want to know whether I can do the same thing using Intrinsics and how .

    I also experienced the same problem as you mentioned with GCC tools.
    But when there so not much ARM code between NEON codes or NEON intrinsics statements, then GCC is doing better in
    generating assembly with "Tighter Neon" code with out data transactions between registers and stack.
    What I observed is register abstraction to the neon variables used in intrinsics is not properly as it is doing for the ARM code.

    Please let me what I am observing is correct.


    BRs,
    Kiran Kumar





    GCC is the same as RVCT:

    uint16x8x3_t vld3q_u16 (const uint16_t *)

    Check http://gcc.gnu.org/o...Intrinsics.html for the full listing (loads and stores are towards the bottom).
    One bit of advice - use objdump to check the disassembly GCC emits for NEON intrinsics. Personally I've never been entirely happy with it - it generates an excessive amount of stack traffic to shuffle things between registers - and the intrinsics are so low level you may as well handle register allocation yourself, write the assembler and get the output code you actually wanted in the first place.

    To be fair it is improving a lot in the newer GCC releases, but my personal view is that if you have to spell out instructions using intrinsics one instruction at a time you are basically writing assembler anyway ;)

    Iso
Children
No data