This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

Rnjai Lamba over 12 years ago

Parents

Simon Craske over 12 years ago

Note: This was originally posted on 28th June 2012 at http://forums.arm.com

Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37] * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]

The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

int sumfunc (int *c, int *src): // r0 = c, r1 = src VLD1.32 {d0,d1,d2,d3},[r0] // c[{0,1},{2,3},{4,5},{6,7}] VLD3.32 {d4,d5,d6},[r1]! // src[{ 0, 3},{ -, 4},{ -, -}] // d8-to-d15 left intact to avoid ABI required preserve and restore VLD3.32 {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}] VLD3.32 {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}] VLD3.32 {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}] VLD3.32 {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}] VLD3.32 {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}] VLD1.32 {d7},[r1] // src[{ -,37}] VEXT.8 d5,d5,d21,#4 // d5 = src[{ 4,13}] VEXT.8 d21,d21,d27,#4 // d21 = src[{16,25}] VTRN.32 d7,d27 // d27 = src[{37,28}] VMUL.I32 d4,d4,d0[0] // src[{ 0, 3}] * c[0] VMUL.I32 d5,d5,d0[1] // src[{ 4,13}] * c[1] VMUL.I32 d16,d30,d3[1] // src[{31,34}] * c[7] VMUL.I32 d17,d17,d0[0] // src[{ 6, 9}] * c[0] VMUL.I32 d18,d18,d0[1] // ... VMUL.I32 d19,d19,d1[0] VMUL.I32 d20,d20,d1[1] VMUL.I32 d21,d21,d2[0] VMLA.I32 d4,d22,d1[0] // += src[{14,17}] * c[2] VMLA.I32 d5,d23,d1[1] // += src[{18,21}] * c[3] VMLA.I32 d16,d24,d2[0] // ... VMLA.I32 d17,d25,d2[1] VMLA.I32 d18,d26,d3[0] VMLA.I32 d19,d27,d3[1] VMLA.I32 d20,d28,d2[1] VMLA.I32 d21,d29,d3[0] VADD.I32 q2,q2,q8 // Sum all values VADD.I32 q3,q9,q10 VADD.I32 q0,q2,q3 VADD.I32 d0,d0,d1 VPADD.I32 d0,d0,d0 // Final sum to s0 VMOV.32 r0,d0[0] // Move result to return value BX lr // Return

hth
s.
Cancel
Vote up 0 Vote down

Cancel

Reply

Simon Craske over 12 years ago

Note: This was originally posted on 28th June 2012 at http://forums.arm.com

Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37] * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]

The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

int sumfunc (int *c, int *src): // r0 = c, r1 = src VLD1.32 {d0,d1,d2,d3},[r0] // c[{0,1},{2,3},{4,5},{6,7}] VLD3.32 {d4,d5,d6},[r1]! // src[{ 0, 3},{ -, 4},{ -, -}] // d8-to-d15 left intact to avoid ABI required preserve and restore VLD3.32 {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}] VLD3.32 {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}] VLD3.32 {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}] VLD3.32 {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}] VLD3.32 {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}] VLD1.32 {d7},[r1] // src[{ -,37}] VEXT.8 d5,d5,d21,#4 // d5 = src[{ 4,13}] VEXT.8 d21,d21,d27,#4 // d21 = src[{16,25}] VTRN.32 d7,d27 // d27 = src[{37,28}] VMUL.I32 d4,d4,d0[0] // src[{ 0, 3}] * c[0] VMUL.I32 d5,d5,d0[1] // src[{ 4,13}] * c[1] VMUL.I32 d16,d30,d3[1] // src[{31,34}] * c[7] VMUL.I32 d17,d17,d0[0] // src[{ 6, 9}] * c[0] VMUL.I32 d18,d18,d0[1] // ... VMUL.I32 d19,d19,d1[0] VMUL.I32 d20,d20,d1[1] VMUL.I32 d21,d21,d2[0] VMLA.I32 d4,d22,d1[0] // += src[{14,17}] * c[2] VMLA.I32 d5,d23,d1[1] // += src[{18,21}] * c[3] VMLA.I32 d16,d24,d2[0] // ... VMLA.I32 d17,d25,d2[1] VMLA.I32 d18,d26,d3[0] VMLA.I32 d19,d27,d3[1] VMLA.I32 d20,d28,d2[1] VMLA.I32 d21,d29,d3[0] VADD.I32 q2,q2,q8 // Sum all values VADD.I32 q3,q9,q10 VADD.I32 q0,q2,q3 VADD.I32 d0,d0,d1 VPADD.I32 d0,d0,d0 // Final sum to s0 VMOV.32 r0,d0[0] // Move result to return value BX lr // Return

hth
s.
Cancel
Vote up 0 Vote down

Cancel

Children

No data