Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Unrolling a loop

Hello,

The following code is used to calculate the abs (sqrt (i^2+q^2)) of a complex float vector. It runs under cortex-a53 

I compiled the code with: -O2 -mcpu=cortex-a53

Abs(ComplexFloat *pIn, float *pOut, uint32_t N)
{
	float *pDst = (float*)pOut;
	float32x4_t Res;
	float32x4x2_t Vec;
	ComplexFloat *pSrc = pIn;

	//Loop on all input elements 
	for (int n = 0; n < N >> 2; n++)
	{
		Vec = vld2q_f32((float*)pSrc);
		Res = vdupq_n_f32(0);

		Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]);
		Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]);
		Res = vsqrtq_f32(Res);

		vst1q_f32((float*)pDst, Res);
		pDst += 4;
		pSrc += 4;
	}
}

I trying to improve the performance by unrolling the loop. Each iteration now works on 4 vectors (float32 x 4) 

void Abs(ComplexFloat *pIn, float *pOut, uint32_t N)
{
	float *pDst = (float*)pOut;
	float32x4_t Res0, Res1, Res2, Res3;
	float32x4x2_t Vec0, Vec1, Vec2, Vec3;
	ComplexFloat *pSrc = pIn;

	//Loop on all input elements 
	for (int n = 0; n < N >> 4; n++)
	{
		Vec0 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec0.val[0]);
		Res0 = vmulq_f32 (Vec0.val[0], Vec0.val[0]);
		Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]);
		Res0 = vsqrtq_f32(Res0);
		pSrc += 4;

		Vec1 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec1.val[0]);
		vst1q_f32((float*)pDst, Res0);
		pDst += 4;

		Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]);
		Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]);
		Res1 = vsqrtq_f32(Res1);
		pSrc += 4;

		Vec2 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec2.val[0]);
		vst1q_f32((float*)pDst, Res1);
		pDst += 4;

		Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]);
		Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]);
		Res2 = vsqrtq_f32(Res2);
		pSrc += 4;

		Vec3 = vld2q_f32((float*)pSrc);
		//DisplayVectorFloat32x4(Vec3.val[0]);
		vst1q_f32((float*)pDst, Res2);
		pDst += 4;

		Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]);
		Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]);
		Res3 = vsqrtq_f32(Res3);
		pSrc += 4;

		vst1q_f32((float*)pDst, Res3);
		pDst += 4;
	}
}

It seems that the unrolled code works 15% faster.

Can I improve it by mixing the sequence of the load - calc - store ?

Thank you,

Zvika 

Parents Reply Children