This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON intrinsics arguments

Note: This was originally posted on 30th August 2011 at http://forums.arm.com

I'm using the return value from an intrinsic as an argument for another one like below:

int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));


Is that a good idea? Is there anything against using this kind of construct?
I guess the temporary vector returned from one of the inner intrinsics calls still has to be "assigned" to a
NEON register(at least that's my feeling the compiler will do). What happens if in a loop I try to optimize
using intrinsics I exceed the number of the actual Q registers NEON has available?
Parents
  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com

    Hi,
    This approach is fine.

    As you guessed the compiler will still have to create temporary/intermediate values but will reuse the registers without any trouble, so in your example its likely to never use more than 1 additional temp as well as the final register. Something like:

    int16x8_t final_vec = vmvnq_s16(final_mask_vec);
    int16x8_t tmp0 = vandq_s16(final_mask_vec, bad_vec);
    final_vec = vandq_s16(final_vec, good_vec);
    final_vec = vorrq_s16(tmp0, final_vec); // 'tmp0' register available after this

    As with general arm code any values that can't be fitted into the available registers will be put onto the stack.

    > int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));

    I should add that this example can be replaced with the vbslq_s16 intrinsic (indeed the compiler might manage this for you):

    int16x8_t final_vec = vbslq_s16( vreinterpretq_u16_s16( final_mask_vec ), bad_vec, good_vec );

    Simon.
Reply
  • Note: This was originally posted on 30th August 2011 at http://forums.arm.com

    Hi,
    This approach is fine.

    As you guessed the compiler will still have to create temporary/intermediate values but will reuse the registers without any trouble, so in your example its likely to never use more than 1 additional temp as well as the final register. Something like:

    int16x8_t final_vec = vmvnq_s16(final_mask_vec);
    int16x8_t tmp0 = vandq_s16(final_mask_vec, bad_vec);
    final_vec = vandq_s16(final_vec, good_vec);
    final_vec = vorrq_s16(tmp0, final_vec); // 'tmp0' register available after this

    As with general arm code any values that can't be fitted into the available registers will be put onto the stack.

    > int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));

    I should add that this example can be replaced with the vbslq_s16 intrinsic (indeed the compiler might manage this for you):

    int16x8_t final_vec = vbslq_s16( vreinterpretq_u16_s16( final_mask_vec ), bad_vec, good_vec );

    Simon.
Children
No data