One option would be do what a compiler normally does for ARM; use fixed point domain VRECPE to calculate 1/v1, etc, and then multiply out rather than divide, and then renormalize.
vrecpe.f32 d1, d5 vrecps.f32 d2, d1, d5 vmul.f32 d1, d1, d2 vrecps.f32 d2, d1, d5 vmul.f32 d5, d1, d2
View all questions in Arm Development Studio forum