We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
The following code is used to calculate the abs (sqrt (i^2+q^2)) of a complex float vector. It runs under cortex-a53
I compiled the code with: -O2 -mcpu=cortex-a53
Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res; float32x4x2_t Vec; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 2; n++) { Vec = vld2q_f32((float*)pSrc); Res = vdupq_n_f32(0); Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]); Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]); Res = vsqrtq_f32(Res); vst1q_f32((float*)pDst, Res); pDst += 4; pSrc += 4; } }
I trying to improve the performance by unrolling the loop. Each iteration now works on 4 vectors (float32 x 4)
void Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res0, Res1, Res2, Res3; float32x4x2_t Vec0, Vec1, Vec2, Vec3; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 4; n++) { Vec0 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec0.val[0]); Res0 = vmulq_f32 (Vec0.val[0], Vec0.val[0]); Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]); Res0 = vsqrtq_f32(Res0); pSrc += 4; Vec1 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec1.val[0]); vst1q_f32((float*)pDst, Res0); pDst += 4; Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]); Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]); Res1 = vsqrtq_f32(Res1); pSrc += 4; Vec2 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec2.val[0]); vst1q_f32((float*)pDst, Res1); pDst += 4; Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]); Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]); Res2 = vsqrtq_f32(Res2); pSrc += 4; Vec3 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec3.val[0]); vst1q_f32((float*)pDst, Res2); pDst += 4; Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]); Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]); Res3 = vsqrtq_f32(Res3); pSrc += 4; vst1q_f32((float*)pDst, Res3); pDst += 4; } }
It seems that the unrolled code works 15% faster.
Can I improve it by mixing the sequence of the load - calc - store ?
Thank you,
Zvika