Hi,
I'm experimenting with NEON on a i.mx7d SoC.
I'm trying to do the following calculation.
I've got 8 RGB pixels stored in a vector. uint8x8x3_t rgb
I then want to calculate:
rside=R*19*19
rsideext=(R+1)*19*19
gside=G*19
gsideext=(G+1)*19
bsideext=B+1
My R,G,B values are 8-bit.
Now, my scalar value of 19*19 to get rside is 16-bit.
So I need to find a way to do vmul 8-bit with a 16-bit scalar.
I'm seeing:
uint16x8_t vmulq_u16(uint16x8_t a, uint16x8_t b);oruint16x4_t vmul_lane_u16 (uint16x4_t, uint16x4_t, const int) whereas I'd been hoping for:uint16x8_t something(uint8x8_t r, const int)I guess that kind of abstraction doesn't exist.So now I'm thinking:uint16x8_t v_rside;uint16x4_t v_rside_u;
uint16x8_t vmulq_u16(uint16x8_t a, uint16x8_t b);
uint16x4_t v_rside_l;
uint8x8_t side = vdup_n_u8 (19)v_rside = vmull_u8(rgb.val[0], side);v_rside_l = vmul_lane_u16(vget_low_u16(v_rside), 19);v_rside_u = vmul_lane_u16(vget_high_u16(v_rside), 19);v_rside = vcombine_u16(v_rside_l, v_rside_u);Is that the most efficient way to do it?Then I've got the rside_ext which I think I can get by adding 19*19.Except I don't see a vadd scalar.
I see something like:
vmlal_lane_u16 (uint32x4_t __a, uint16x4_t __b, uint16x4_t __c, const int __d)
But that expands it to uint32x4 which I don't need.
So I guess I would need to do:
v_rside_ext = vaddq_u16(v_rside, side);
except I'd need to make a uint16x8_t side = vdup_n_u16(19).
This is getting ugly now.
I figure I better ask if I'm thinking about all of this correctly or if there's a much cleaner way to do this.
Thanks!