This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

Parents
  • Note: This was originally posted on 28th June 2012 at http://forums.arm.com

    Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

    src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37]
    * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]


    The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

    Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

    int sumfunc (int *c, int *src):              

        // r0 = c, r1 = src
        VLD1.32  {d0,d1,d2,d3},[r0]  // c[{0,1},{2,3},{4,5},{6,7}]
        VLD3.32  {d4,d5,d6},[r1]!    // src[{ 0, 3},{ -, 4},{ -, -}]
        // d8-to-d15 left intact to avoid ABI required preserve and restore
        VLD3.32  {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}]
        VLD3.32  {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}]
        VLD3.32  {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}]
        VLD3.32  {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}]
        VLD3.32  {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}]
        VLD1.32  {d7},[r1]           // src[{ -,37}]
        VEXT.8   d5,d5,d21,#4        // d5  = src[{ 4,13}]
        VEXT.8   d21,d21,d27,#4      // d21 = src[{16,25}]
        VTRN.32  d7,d27              // d27 = src[{37,28}]
        VMUL.I32 d4,d4,d0[0]         // src[{ 0, 3}] * c[0]
        VMUL.I32 d5,d5,d0[1]         // src[{ 4,13}] * c[1]
        VMUL.I32 d16,d30,d3[1]       // src[{31,34}] * c[7]
        VMUL.I32 d17,d17,d0[0]       // src[{ 6, 9}] * c[0]
        VMUL.I32 d18,d18,d0[1]       // ...
        VMUL.I32 d19,d19,d1[0]
        VMUL.I32 d20,d20,d1[1]
        VMUL.I32 d21,d21,d2[0]
        VMLA.I32 d4,d22,d1[0]        // += src[{14,17}] * c[2]
        VMLA.I32 d5,d23,d1[1]        // += src[{18,21}] * c[3]
        VMLA.I32 d16,d24,d2[0]       // ...
        VMLA.I32 d17,d25,d2[1]
        VMLA.I32 d18,d26,d3[0]
        VMLA.I32 d19,d27,d3[1]
        VMLA.I32 d20,d28,d2[1]
        VMLA.I32 d21,d29,d3[0]
        VADD.I32 q2,q2,q8            // Sum all values
        VADD.I32 q3,q9,q10
        VADD.I32 q0,q2,q3
        VADD.I32 d0,d0,d1
        VPADD.I32 d0,d0,d0           // Final sum to s0
        VMOV.32  r0,d0[0]            // Move result to return value
        BX       lr                  // Return


    hth
    s.
Reply
  • Note: This was originally posted on 28th June 2012 at http://forums.arm.com

    Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

    src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37]
    * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]


    The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

    Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

    int sumfunc (int *c, int *src):              

        // r0 = c, r1 = src
        VLD1.32  {d0,d1,d2,d3},[r0]  // c[{0,1},{2,3},{4,5},{6,7}]
        VLD3.32  {d4,d5,d6},[r1]!    // src[{ 0, 3},{ -, 4},{ -, -}]
        // d8-to-d15 left intact to avoid ABI required preserve and restore
        VLD3.32  {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}]
        VLD3.32  {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}]
        VLD3.32  {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}]
        VLD3.32  {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}]
        VLD3.32  {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}]
        VLD1.32  {d7},[r1]           // src[{ -,37}]
        VEXT.8   d5,d5,d21,#4        // d5  = src[{ 4,13}]
        VEXT.8   d21,d21,d27,#4      // d21 = src[{16,25}]
        VTRN.32  d7,d27              // d27 = src[{37,28}]
        VMUL.I32 d4,d4,d0[0]         // src[{ 0, 3}] * c[0]
        VMUL.I32 d5,d5,d0[1]         // src[{ 4,13}] * c[1]
        VMUL.I32 d16,d30,d3[1]       // src[{31,34}] * c[7]
        VMUL.I32 d17,d17,d0[0]       // src[{ 6, 9}] * c[0]
        VMUL.I32 d18,d18,d0[1]       // ...
        VMUL.I32 d19,d19,d1[0]
        VMUL.I32 d20,d20,d1[1]
        VMUL.I32 d21,d21,d2[0]
        VMLA.I32 d4,d22,d1[0]        // += src[{14,17}] * c[2]
        VMLA.I32 d5,d23,d1[1]        // += src[{18,21}] * c[3]
        VMLA.I32 d16,d24,d2[0]       // ...
        VMLA.I32 d17,d25,d2[1]
        VMLA.I32 d18,d26,d3[0]
        VMLA.I32 d19,d27,d3[1]
        VMLA.I32 d20,d28,d2[1]
        VMLA.I32 d21,d29,d3[0]
        VADD.I32 q2,q2,q8            // Sum all values
        VADD.I32 q3,q9,q10
        VADD.I32 q0,q2,q3
        VADD.I32 d0,d0,d1
        VPADD.I32 d0,d0,d0           // Final sum to s0
        VMOV.32  r0,d0[0]            // Move result to return value
        BX       lr                  // Return


    hth
    s.
Children
No data