Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hi All,
I need perform multiply long operation on uint16x8_t data type on ARM v8.
The ARM v7 implementation would be as follows:
uint16x8_t u16x8_data1 = vld1q_u16(pBuffer1);
uint16x8_t u16x8_data2 = vld1q_u16(pBuffer2);
uint32x4_t u32x4_mul_result_low = vmull_u16(vget_low_u16(u16x8_data1),vget_low_u16(u16x8_data2));
uint32x4_t u32x4_mul_result_high = vmull_u16(vget_high_u16(u16x8_data1),vget_high_u16(u16x8_data2));
In ARM v8 we have the instruction vmull_high_u16(), which directly operates on the last 4 elements on u16x8_data1 and u16x8_data2.
But there is no corresponding instruction for the first 4 elements(low).
i.e uint32x4_t u32x4_mul_result_high = vmull_high_u16(u16x8_data1,u16x8_data2). So here can avoid vget instruction.
But there is no corresponding vmull_low_u16() instruction.
So my query is, How to perform the mull on the lower data without using vget instruction?