NEON multiplying 8-bit vector with 16-bit scalars


I'm experimenting with NEON on a i.mx7d SoC.

I'm trying to do the following calculation.

I've got 8 RGB pixels stored in a vector. uint8x8x3_t rgb

I then want to calculate:






My R,G,B values are 8-bit.

Now, my scalar value of 19*19 to get rside is 16-bit.

So I need to find a way to do vmul 8-bit with a 16-bit scalar.

I'm seeing:

uint16x8_t  vmulq_u16(uint16x8_t a, uint16x8_t b);

uint16x4_t vmul_lane_u16 (uint16x4_t, uint16x4_t, const int) 

whereas I'd been hoping for:
uint16x8_t something(uint8x8_t r, const int)

I guess that kind of abstraction doesn't exist.

So now I'm thinking:
uint16x8_t v_rside;
uint16x4_t v_rside_u;
uint16x4_t v_rside_l;

uint8x8_t side = vdup_n_u8 (19)

v_rside = vmull_u8(rgb.val[0], side);
v_rside_l = vmul_lane_u16(vget_low_u16(v_rside), 19);
v_rside_u = vmul_lane_u16(vget_high_u16(v_rside), 19);
v_rside = vcombine_u16(v_rside_l, v_rside_u);

Is that the most efficient way to do it?

Then I've got the rside_ext which I think I can get by adding 19*19.
Except I don't see a vadd scalar.

I see something like:

vmlal_lane_u16 (uint32x4_t __a, uint16x4_t __b, uint16x4_t __c, const int __d)

But that expands it to uint32x4 which I don't need.

So I guess I would need to do:

v_rside_ext = vaddq_u16(v_rside, side);

except I'd need to make a uint16x8_t side = vdup_n_u16(19).

This is getting ugly now.

I figure I better ask if I'm thinking about all of this correctly or if there's a much cleaner way to do this.


