This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Is there an intrinsic to store 3 float values?

I have the following code in assembler:

	vst1.32	{d10}, [%[pOutVertex2]]
	fsts	s22, [%[pOutVertex2], #8]

This stores s20, s21, s22 into pOutVertex which is an array of 3 floats. Is there an intrinsic to do this? I can only find vst1q_f32, but that would overwrite the 4th value in pOutVertex.

Parents

+1 Simon Craske over 8 years ago

Lefty,

You are likely looking for "vst3_lane_f32()", however, this stores a single element from each of three registers, i.e. the values you wish to store would have to be in a single lane of a float32x2x3_t.

As a bad example for illustration only:

#include <arm_neon.h>

void store_three_floats(float a, float b, float c, float *dst)
{
  float32x2x3_t vec;                            // Declare trio of vectors

  vec.val[0] = vset_lane_f32(a, vec.val[0], 0); // Set lowest lane in vector 0
  vec.val[1] = vset_lane_f32(b, vec.val[1], 0); // Set lowest lane in vector 1
  vec.val[2] = vset_lane_f32(c, vec.val[2], 0); // Set lowest lane in vector 2

  vst3_lane_f32(dst, vec, 0); // Store lowest element from each of the trio
}

In the general case, the three values would already be in independent vectors (e.g. R, G, B), and thus only the vst3 would be required without the lane insertions.

hth

Simon.

Reply

+1 Simon Craske over 8 years ago

Lefty,

As a bad example for illustration only:

#include <arm_neon.h>

void store_three_floats(float a, float b, float c, float *dst)
{
  float32x2x3_t vec;                            // Declare trio of vectors

  vec.val[0] = vset_lane_f32(a, vec.val[0], 0); // Set lowest lane in vector 0
  vec.val[1] = vset_lane_f32(b, vec.val[1], 0); // Set lowest lane in vector 1
  vec.val[2] = vset_lane_f32(c, vec.val[2], 0); // Set lowest lane in vector 2

  vst3_lane_f32(dst, vec, 0); // Store lowest element from each of the trio
}

In the general case, the three values would already be in independent vectors (e.g. R, G, B), and thus only the vst3 would be required without the lane insertions.

hth

Simon.

Children

0 lefty over 8 years ago in reply to Simon Craske

Thanks for the answer, Simon.
It's rather unfortunate that it takes 4 instructions, but I suppose there's no other way.
Cancel
Up 0 Down

Cancel
0 Peter Harris over 8 years ago in reply to lefty

A big part of writing effective DSP/NEON type code is getting the data flow right, so you don't need to move or copy data around. The actual store here is only a single cycle, provided it's possible to work the data motion into the algorithm so it isn't an extra step bolted on at the end of an existing assembler routine.
Cancel
Up 0 Down

Cancel
0 lefty over 8 years ago in reply to Peter Harris

Well, I can't see how, although I am no expert in this. You're welcome to try if you want.
This is the code that I have so far:
[code]
inline void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const
{
#ifdef USE_NEONX
   float32x4_t matrixRow1 = vld1q_f32(m);
   float32x4_t matrixRow2 = vld1q_f32(&m[4]);
    float32x4_t matrixRow3 = vld1q_f32(&m[8]);
    float32x4_t matrixRow4 = vld1q_f32(&m[12]);
    float32x4_t out1 = vmulq_n_f32(matrixRow1, pInVertex[0]); //    out1 = matrixRow1 * pInVertex1[0];
    out1 = vmlaq_n_f32(out1, matrixRow2, pInVertex[1]); //    out1 += matrixRow2 * pInVertex1[1];
    out1 = vmlaq_n_f32(out1, matrixRow3, pInVertex[2]); //    out1 += matrixRow3 * pInVertex1[2];
    out1 = vaddq_f32(out1, matrixRow4); //    out1 += matrixRow4;
    out1 = vmulq_n_f32(out1, weight); //    out1 *= weight;
    float values[4];
    vst1q_f32(values, out1);
    // then add 3 of values[4] to pOutVertex1
     // note: I gave up and used c++ here.
    pOutVertex[0] += values[0];
    pOutVertex[1] += values[1];
    pOutVertex[2] += values[2];
#else
    pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);
    pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);
    pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);
#endif
}
[/code]
Cancel
Up 0 Down

Cancel
0 Peter Harris over 8 years ago in reply to lefty

Things which are not a power of two or a full register are a pain in NEON, but if you are willing to sacrifice a little storage space the obvious data layout change would be to allocate vec4() inputs and outputs. The final increment then becomes a vec4 load of pOutVertex, a vec4 addition, and a vec4 store to write the incremented value of pOutVertex.
Cancel
Up 0 Down

Cancel

out1 = vaddq_f32(out1, matrixRow4);				// out1 += matrixRow4;
out1 = vmulq_n_f32(out1, weight);				// out1 *= weight;