Hi All,
I need perform multiply long operation on uint16x8_t data type on ARM v8.
The ARM v7 implementation would be as follows:
uint16x8_t u16x8_data1 = vld1q_u16(pBuffer1);
uint16x8_t u16x8_data2 = vld1q_u16(pBuffer2);
uint32x4_t u32x4_mul_result_low = vmull_u16(vget_low_u16(u16x8_data1),vget_low_u16(u16x8_data2));
uint32x4_t u32x4_mul_result_high = vmull_u16(vget_high_u16(u16x8_data1),vget_high_u16(u16x8_data2));
In ARM v8 we have the instruction vmull_high_u16(), which directly operates on the last 4 elements on u16x8_data1 and u16x8_data2.
But there is no corresponding instruction for the first 4 elements(low).
i.e uint32x4_t u32x4_mul_result_high = vmull_high_u16(u16x8_data1,u16x8_data2). So here can avoid vget instruction.
But there is no corresponding vmull_low_u16() instruction.
So my query is, How to perform the mull on the lower data without using vget instruction?