We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
One option would be do what a compiler normally does for ARM; use fixed point domain VRECPE to calculate 1/v1, etc, and then multiply out rather than divide, and then renormalize.
vrecpe.f32 d1, d5 vrecps.f32 d2, d1, d5 vmul.f32 d1, d1, d2 vrecps.f32 d2, d1, d5 vmul.f32 d5, d1, d2
vcvt.f32.u32 q0, q0 vrecpe.f32 q0, q0 vmul.f32 q0, q0, q1 @ q1 = 65536 vcvt.u32.f32 q0, q0
Glad that's working out for you. Out of curiosity, does this work? vcvt.f32.u32 q0, q0 vrecpe.f32 q0, q0 vcvt.u32.f32 q0, q0, #16
vcvt.f32.u32 q0, q0 vrecpe.f32 q0, q0 vcvt.u32.f32 q0, q0, #16
// 8x16-bit signed inputs are in q0// Elements in q1 are 0xFFFF for negative values, 0x0000 for positive (or zero) valuesvclt.s16 q1, q0, #0// Make negative values positivevabs.s16 q0, q0// ... Division performed here, results in q0 ...// Negate values that were negative. This is done by observing that neg(x) = not(x) + 1.// For values that were negative the field in q1 was 0xFFFF, therefore we get ((x ^ 0xFFFF) - 0xFFFF) which is not(x) + 1.// For values that were positive the field in q1 was 0x0000, therefore we get (x ^ 0x0000) - 0x0000 which is just x.// If you can, put some other operation between these two instructions to avoid a stall.veor.s16 q0, q0, q1vsub.s16 q0, q0, q1