This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON intrinsics arguments

Note: This was originally posted on 30th August 2011 at http://forums.arm.com

I'm using the return value from an intrinsic as an argument for another one like below:

int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));

Is that a good idea? Is there anything against using this kind of construct?
I guess the temporary vector returned from one of the inner intrinsics calls still has to be "assigned" to a
NEON register(at least that's my feeling the compiler will do). What happens if in a loop I try to optimize
using intrinsics I exceed the number of the actual Q registers NEON has available?

Parents

Simon Pilgrim over 12 years ago

Note: This was originally posted on 30th August 2011 at http://forums.arm.com

Hi,
This approach is fine.

As you guessed the compiler will still have to create temporary/intermediate values but will reuse the registers without any trouble, so in your example its likely to never use more than 1 additional temp as well as the final register. Something like:

int16x8_t final_vec = vmvnq_s16(final_mask_vec);
int16x8_t tmp0 = vandq_s16(final_mask_vec, bad_vec);
final_vec = vandq_s16(final_vec, good_vec);
final_vec = vorrq_s16(tmp0, final_vec); // 'tmp0' register available after this

As with general arm code any values that can't be fitted into the available registers will be put onto the stack.

> int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));

I should add that this example can be replaced with the vbslq_s16 intrinsic (indeed the compiler might manage this for you):

int16x8_t final_vec = vbslq_s16( vreinterpretq_u16_s16( final_mask_vec ), bad_vec, good_vec );

Simon.
Cancel
Vote up 0 Vote down

Cancel

Reply

Simon Pilgrim over 12 years ago

Note: This was originally posted on 30th August 2011 at http://forums.arm.com

Hi,
This approach is fine.

As you guessed the compiler will still have to create temporary/intermediate values but will reuse the registers without any trouble, so in your example its likely to never use more than 1 additional temp as well as the final register. Something like:

int16x8_t final_vec = vmvnq_s16(final_mask_vec);
int16x8_t tmp0 = vandq_s16(final_mask_vec, bad_vec);
final_vec = vandq_s16(final_vec, good_vec);
final_vec = vorrq_s16(tmp0, final_vec); // 'tmp0' register available after this

As with general arm code any values that can't be fitted into the available registers will be put onto the stack.

> int16x8_t final_vec = vorrq_s16(vandq_s16(final_mask_vec, bad_vec), vandq_s16(vmvnq_s16(final_mask_vec), good_vec));

I should add that this example can be replaced with the vbslq_s16 intrinsic (indeed the compiler might manage this for you):

int16x8_t final_vec = vbslq_s16( vreinterpretq_u16_s16( final_mask_vec ), bad_vec, good_vec );

Simon.
Cancel
Vote up 0 Vote down

Cancel

Children

No data