This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Is there an intrinsic to store 3 float values?

I have the following code in assembler:

    vst1.32            {d10}, [%[pOutVertex2]]          
   fsts               s22, [%[pOutVertex2], #8]             

This stores s20, s21, s22 into pOutVertex which is an array of 3 floats. Is there an intrinsic to do this? I can only find vst1q_f32, but that would overwrite the 4th value in pOutVertex.

Parents Reply Children
  • A big part of writing effective DSP/NEON type code is getting the data flow right, so you don't need to move or copy data around. The actual store here is only a single cycle, provided it's possible to work the data motion into the algorithm so it isn't an extra step bolted on at the end of an existing assembler routine.

  • Well, I can't see how, although I am no expert in this. You're welcome to try if you want.

    This is the code that I have so far:

    [code]

    inline void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const

    {

    #ifdef USE_NEONX

       float32x4_t matrixRow1 = vld1q_f32(m);

       float32x4_t matrixRow2 = vld1q_f32(&m[4]);

        float32x4_t matrixRow3 = vld1q_f32(&m[8]);

        float32x4_t matrixRow4 = vld1q_f32(&m[12]);

        float32x4_t out1 = vmulq_n_f32(matrixRow1, pInVertex[0]); //    out1 = matrixRow1 * pInVertex1[0];

        out1 = vmlaq_n_f32(out1, matrixRow2, pInVertex[1]); //    out1 += matrixRow2 * pInVertex1[1];

        out1 = vmlaq_n_f32(out1, matrixRow3, pInVertex[2]); //    out1 += matrixRow3 * pInVertex1[2];

        out1 = vaddq_f32(out1, matrixRow4);  //    out1 += matrixRow4;
        out1 = vmulq_n_f32(out1, weight);  //    out1 *= weight;

        float values[4];

        vst1q_f32(values, out1);

        // then add 3 of values[4] to pOutVertex1

         // note: I gave up and used c++ here.

        pOutVertex[0] += values[0];

        pOutVertex[1] += values[1];

        pOutVertex[2] += values[2];

    #else

        pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);

        pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);

        pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);

    #endif

    }

    [/code]

  • Things which are not a power of two or a full register are a pain in NEON, but if you are willing to sacrifice a little storage space the obvious data layout change would be to allocate vec4() inputs and outputs. The final increment then becomes a vec4 load of pOutVertex, a vec4 addition, and a vec4 store to write the incremented value of pOutVertex.