This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Unrolling a loop

Hello,

The following code is used to calculate the abs (sqrt (i^2+q^2)) of a complex float vector. It runs under cortex-a53 

I compiled the code with: -O2 -mcpu=cortex-a53

Abs(ComplexFloat *pIn, float *pOut, uint32_t N)
{
	float *pDst = (float*)pOut;
	float32x4_t Res;
	float32x4x2_t Vec;
	ComplexFloat *pSrc = pIn;

	//Loop on all input elements 
	for (int n = 0; n < N >> 2; n++)
	{
		Vec = vld2q_f32((float*)pSrc);
		Res = vdupq_n_f32(0);

		Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]);
		Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]);
		Res = vsqrtq_f32(Res);

		vst1q_f32((float*)pDst, Res);
		pDst += 4;
		pSrc += 4;
	}
}

I trying to improve the performance by unrolling the loop. Each iteration now works on 4 vectors (float32 x 4) 

void Abs(ComplexFloat *pIn, float *pOut, uint32_t N)
{
	float *pDst = (float*)pOut;
	float32x4_t Res0, Res1, Res2, Res3;
	float32x4x2_t Vec0, Vec1, Vec2, Vec3;
	ComplexFloat *pSrc = pIn;

	//Loop on all input elements 
	for (int n = 0; n < N >> 4; n++)
	{
		Vec0 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec0.val[0]);
		Res0 = vmulq_f32 (Vec0.val[0], Vec0.val[0]);
		Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]);
		Res0 = vsqrtq_f32(Res0);
		pSrc += 4;

		Vec1 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec1.val[0]);
		vst1q_f32((float*)pDst, Res0);
		pDst += 4;

		Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]);
		Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]);
		Res1 = vsqrtq_f32(Res1);
		pSrc += 4;

		Vec2 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec2.val[0]);
		vst1q_f32((float*)pDst, Res1);
		pDst += 4;

		Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]);
		Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]);
		Res2 = vsqrtq_f32(Res2);
		pSrc += 4;

		Vec3 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec3.val[0]);
		vst1q_f32((float*)pDst, Res2);
		pDst += 4;

		Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]);
		Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]);
		Res3 = vsqrtq_f32(Res3);
		pSrc += 4;

		vst1q_f32((float*)pDst, Res3);
		pDst += 4;
	}
}

It seems that the unrolled code works 15% faster.

Can I improve it by mixing the sequence of the load - calc - store ?

Thank you,

Zvika