Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM v8 Neon instruction for multiply long

 Hi All,

        I need perform multiply long operation on uint16x8_t data type on ARM v8.

 

The ARM v7 implementation would be as follows:  

uint16x8_t u16x8_data1 = vld1q_u16(pBuffer1);

uint16x8_t u16x8_data2 = vld1q_u16(pBuffer2);

uint32x4_t  u32x4_mul_result_low = vmull_u16(vget_low_u16(u16x8_data1),vget_low_u16(u16x8_data2));

uint32x4_t  u32x4_mul_result_high = vmull_u16(vget_high_u16(u16x8_data1),vget_high_u16(u16x8_data2));

 

In ARM v8 we have the instruction vmull_high_u16(), which directly operates on the last 4 elements on u16x8_data1 and u16x8_data2.

But there is no corresponding instruction for the first 4 elements(low).

 

i.e  uint32x4_t  u32x4_mul_result_high = vmull_high_u16(u16x8_data1,u16x8_data2).  So here can avoid vget instruction.

But there is no corresponding vmull_low_u16() instruction.

 

So my query is, How to perform the mull on the lower data without using vget instruction?

 

 

Parents Reply Children
No data