This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Is there an intrinsic to store 3 float values?

I have the following code in assembler:

    vst1.32            {d10}, [%[pOutVertex2]]          
   fsts               s22, [%[pOutVertex2], #8]             

This stores s20, s21, s22 into pOutVertex which is an array of 3 floats. Is there an intrinsic to do this? I can only find vst1q_f32, but that would overwrite the 4th value in pOutVertex.

  • Have you looked at vst3? There's two versions of it.

  • Hi,

    how about the below?

       0:   ec80aa03        vstmia  r0, {s20-s22}
    

    Best regards,

    Yasuhiko Koumoto.

  • Yes, but I am looking for the intrinsic version.

  • Well ... if the intrinsic vst1q_f32() is a wrapper for the vst1 32-bit NEON instruction, it would be reasonable to expect that the intrinsic for the vst3 instruction would probably be vst3q_f32() ... at which point Google searching should find the list pretty quickly.

    ARMCC:

    ARM Information Center

    GCC:

    ARM NEON Intrinsics - Using the GNU Compiler Collection (GCC)

    HTH,
    Pete

  • Lefty,

    You are likely looking for "vst3_lane_f32()", however, this stores a single element from each of three registers, i.e. the values you wish to store would have to be in a single lane of a float32x2x3_t.

    As a bad example for illustration only:

    #include <arm_neon.h>
    
    void store_three_floats(float a, float b, float c, float *dst)
    {
      float32x2x3_t vec;                            // Declare trio of vectors
    
      vec.val[0] = vset_lane_f32(a, vec.val[0], 0); // Set lowest lane in vector 0
      vec.val[1] = vset_lane_f32(b, vec.val[1], 0); // Set lowest lane in vector 1
      vec.val[2] = vset_lane_f32(c, vec.val[2], 0); // Set lowest lane in vector 2
    
      vst3_lane_f32(dst, vec, 0); // Store lowest element from each of the trio
    }
    

    In the general case, the three values would already be in independent vectors (e.g. R, G, B), and thus only the vst3 would be required without the lane insertions.

    hth

    Simon.

  • Thanks for the answer, Simon.

    It's rather unfortunate that it takes 4 instructions, but I suppose there's no other way.

  • A big part of writing effective DSP/NEON type code is getting the data flow right, so you don't need to move or copy data around. The actual store here is only a single cycle, provided it's possible to work the data motion into the algorithm so it isn't an extra step bolted on at the end of an existing assembler routine.

  • Well, I can't see how, although I am no expert in this. You're welcome to try if you want.

    This is the code that I have so far:

    [code]

    inline void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const

    {

    #ifdef USE_NEONX

       float32x4_t matrixRow1 = vld1q_f32(m);

       float32x4_t matrixRow2 = vld1q_f32(&m[4]);

        float32x4_t matrixRow3 = vld1q_f32(&m[8]);

        float32x4_t matrixRow4 = vld1q_f32(&m[12]);

        float32x4_t out1 = vmulq_n_f32(matrixRow1, pInVertex[0]); //    out1 = matrixRow1 * pInVertex1[0];

        out1 = vmlaq_n_f32(out1, matrixRow2, pInVertex[1]); //    out1 += matrixRow2 * pInVertex1[1];

        out1 = vmlaq_n_f32(out1, matrixRow3, pInVertex[2]); //    out1 += matrixRow3 * pInVertex1[2];

        out1 = vaddq_f32(out1, matrixRow4);  //    out1 += matrixRow4;
        out1 = vmulq_n_f32(out1, weight);  //    out1 *= weight;

        float values[4];

        vst1q_f32(values, out1);

        // then add 3 of values[4] to pOutVertex1

         // note: I gave up and used c++ here.

        pOutVertex[0] += values[0];

        pOutVertex[1] += values[1];

        pOutVertex[2] += values[2];

    #else

        pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);

        pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);

        pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);

    #endif

    }

    [/code]

  • Things which are not a power of two or a full register are a pain in NEON, but if you are willing to sacrifice a little storage space the obvious data layout change would be to allocate vec4() inputs and outputs. The final increment then becomes a vec4 load of pOutVertex, a vec4 addition, and a vec4 store to write the incremented value of pOutVertex.