I have the following code in assembler:
This stores s20, s21, s22 into pOutVertex which is an array of 3 floats. Is there an intrinsic to do this? I can only find vst1q_f32, but that would overwrite the 4th value in pOutVertex.
Lefty,
You are likely looking for "vst3_lane_f32()", however, this stores a single element from each of three registers, i.e. the values you wish to store would have to be in a single lane of a float32x2x3_t.
As a bad example for illustration only:
#include <arm_neon.h> void store_three_floats(float a, float b, float c, float *dst) { float32x2x3_t vec; // Declare trio of vectors vec.val[0] = vset_lane_f32(a, vec.val[0], 0); // Set lowest lane in vector 0 vec.val[1] = vset_lane_f32(b, vec.val[1], 0); // Set lowest lane in vector 1 vec.val[2] = vset_lane_f32(c, vec.val[2], 0); // Set lowest lane in vector 2 vst3_lane_f32(dst, vec, 0); // Store lowest element from each of the trio }
In the general case, the three values would already be in independent vectors (e.g. R, G, B), and thus only the vst3 would be required without the lane insertions.
hth
Simon.
Thanks for the answer, Simon.
It's rather unfortunate that it takes 4 instructions, but I suppose there's no other way.
A big part of writing effective DSP/NEON type code is getting the data flow right, so you don't need to move or copy data around. The actual store here is only a single cycle, provided it's possible to work the data motion into the algorithm so it isn't an extra step bolted on at the end of an existing assembler routine.
Well, I can't see how, although I am no expert in this. You're welcome to try if you want.
This is the code that I have so far:
[code]
inline void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const
{
#ifdef USE_NEONX
float32x4_t matrixRow1 = vld1q_f32(m);
float32x4_t matrixRow2 = vld1q_f32(&m[4]);
float32x4_t matrixRow3 = vld1q_f32(&m[8]);
float32x4_t matrixRow4 = vld1q_f32(&m[12]);
float32x4_t out1 = vmulq_n_f32(matrixRow1, pInVertex[0]); // out1 = matrixRow1 * pInVertex1[0];
out1 = vmlaq_n_f32(out1, matrixRow2, pInVertex[1]); // out1 += matrixRow2 * pInVertex1[1];
out1 = vmlaq_n_f32(out1, matrixRow3, pInVertex[2]); // out1 += matrixRow3 * pInVertex1[2];
float values[4];
vst1q_f32(values, out1);
// then add 3 of values[4] to pOutVertex1
// note: I gave up and used c++ here.
pOutVertex[0] += values[0];
pOutVertex[1] += values[1];
pOutVertex[2] += values[2];
#else
pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);
pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);
pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);
#endif
}
[/code]
Things which are not a power of two or a full register are a pain in NEON, but if you are willing to sacrifice a little storage space the obvious data layout change would be to allocate vec4() inputs and outputs. The final increment then becomes a vec4 load of pOutVertex, a vec4 addition, and a vec4 store to write the incremented value of pOutVertex.