I have the following code in assembler:
This stores s20, s21, s22 into pOutVertex which is an array of 3 floats. Is there an intrinsic to do this? I can only find vst1q_f32, but that would overwrite the 4th value in pOutVertex.
Thanks for the answer, Simon.
It's rather unfortunate that it takes 4 instructions, but I suppose there's no other way.
A big part of writing effective DSP/NEON type code is getting the data flow right, so you don't need to move or copy data around. The actual store here is only a single cycle, provided it's possible to work the data motion into the algorithm so it isn't an extra step bolted on at the end of an existing assembler routine.
Well, I can't see how, although I am no expert in this. You're welcome to try if you want.
This is the code that I have so far:
[code]
inline void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const
{
#ifdef USE_NEONX
float32x4_t matrixRow1 = vld1q_f32(m);
float32x4_t matrixRow2 = vld1q_f32(&m[4]);
float32x4_t matrixRow3 = vld1q_f32(&m[8]);
float32x4_t matrixRow4 = vld1q_f32(&m[12]);
float32x4_t out1 = vmulq_n_f32(matrixRow1, pInVertex[0]); // out1 = matrixRow1 * pInVertex1[0];
out1 = vmlaq_n_f32(out1, matrixRow2, pInVertex[1]); // out1 += matrixRow2 * pInVertex1[1];
out1 = vmlaq_n_f32(out1, matrixRow3, pInVertex[2]); // out1 += matrixRow3 * pInVertex1[2];
float values[4];
vst1q_f32(values, out1);
// then add 3 of values[4] to pOutVertex1
// note: I gave up and used c++ here.
pOutVertex[0] += values[0];
pOutVertex[1] += values[1];
pOutVertex[2] += values[2];
#else
pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);
pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);
pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);
#endif
}
[/code]
Things which are not a power of two or a full register are a pain in NEON, but if you are willing to sacrifice a little storage space the obvious data layout change would be to allocate vec4() inputs and outputs. The final increment then becomes a vec4 load of pOutVertex, a vec4 addition, and a vec4 store to write the incremented value of pOutVertex.