We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
One option would be do what a compiler normally does for ARM; use fixed point domain VRECPE to calculate 1/v1, etc, and then multiply out rather than divide, and then renormalize.
vrecpe.f32 d1, d5 vrecps.f32 d2, d1, d5 vmul.f32 d1, d1, d2 vrecps.f32 d2, d1, d5 vmul.f32 d5, d1, d2