Hello,
The following code is used to calculate the abs (sqrt (i^2+q^2)) of a complex float vector. It runs under cortex-a53
I compiled the code with: -O2 -mcpu=cortex-a53
Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res; float32x4x2_t Vec; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 2; n++) { Vec = vld2q_f32((float*)pSrc); Res = vdupq_n_f32(0); Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]); Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]); Res = vsqrtq_f32(Res); vst1q_f32((float*)pDst, Res); pDst += 4; pSrc += 4; } }
I trying to improve the performance by unrolling the loop. Each iteration now works on 4 vectors (float32 x 4)
void Abs(ComplexFloat *pIn, float *pOut, uint32_t N) { float *pDst = (float*)pOut; float32x4_t Res0, Res1, Res2, Res3; float32x4x2_t Vec0, Vec1, Vec2, Vec3; ComplexFloat *pSrc = pIn; //Loop on all input elements for (int n = 0; n < N >> 4; n++) { Vec0 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec0.val[0]); Res0 = vmulq_f32 (Vec0.val[0], Vec0.val[0]); Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]); Res0 = vsqrtq_f32(Res0); pSrc += 4; Vec1 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec1.val[0]); vst1q_f32((float*)pDst, Res0); pDst += 4; Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]); Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]); Res1 = vsqrtq_f32(Res1); pSrc += 4; Vec2 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec2.val[0]); vst1q_f32((float*)pDst, Res1); pDst += 4; Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]); Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]); Res2 = vsqrtq_f32(Res2); pSrc += 4; Vec3 = vld2q_f32((float*)pSrc); //DisplayVectorFloat32x4(Vec3.val[0]); vst1q_f32((float*)pDst, Res2); pDst += 4; Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]); Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]); Res3 = vsqrtq_f32(Res3); pSrc += 4; vst1q_f32((float*)pDst, Res3); pDst += 4; } }
It seems that the unrolled code works 15% faster.
Can I improve it by mixing the sequence of the load - calc - store ?
Thank you,
Zvika
I tried to compile the original code with -O3 -otime (https://developer.arm.com/documentation/dui0472/g/compiler-coding-practices/loop-unrolling-in-c-code)
The second unrolled code works ~15% faster.
I tried to unroll to 8 (instead of 4 in the second version). It did not run faster.