Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hello,
The following code is used to calculate the abs (sqrt (i^2+q^2)) of a complex float vector. It runs under cortex-a53
I compiled the code with: -O2 -mcpu=cortex-a53
Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res; float32x4x2_t Vec; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 2; n++) { Vec = vld2q_f32((float*)pSrc); Res = vdupq_n_f32(0); Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]); Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]); Res = vsqrtq_f32(Res); vst1q_f32((float*)pDst, Res); pDst += 4; pSrc += 4; } }
I trying to improve the performance by unrolling the loop. Each iteration now works on 4 vectors (float32 x 4)
void Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res0, Res1, Res2, Res3; float32x4x2_t Vec0, Vec1, Vec2, Vec3; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 4; n++) { Vec0 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec0.val[0]); Res0 = vmulq_f32 (Vec0.val[0], Vec0.val[0]); Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]); Res0 = vsqrtq_f32(Res0); pSrc += 4; Vec1 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec1.val[0]); vst1q_f32((float*)pDst, Res0); pDst += 4; Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]); Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]); Res1 = vsqrtq_f32(Res1); pSrc += 4; Vec2 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec2.val[0]); vst1q_f32((float*)pDst, Res1); pDst += 4; Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]); Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]); Res2 = vsqrtq_f32(Res2); pSrc += 4; Vec3 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec3.val[0]); vst1q_f32((float*)pDst, Res2); pDst += 4; Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]); Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]); Res3 = vsqrtq_f32(Res3); pSrc += 4; vst1q_f32((float*)pDst, Res3); pDst += 4; } }
It seems that the unrolled code works 15% faster.
Can I improve it by mixing the sequence of the load - calc - store ?
Thank you,
Zvika
I tried to unroll to 8 (instead of 4 in the second version). It did not run faster.