Implementation in NEON of non uniform address jumps

  • Note: This was originally posted on 28th June 2012 at http://forums.arm.com

    Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

    src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37]
    * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]


    The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

    Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

    int sumfunc (int *c, int *src):              

        // r0 = c, r1 = src
        VLD1.32  {d0,d1,d2,d3},[r0]  // c[{0,1},{2,3},{4,5},{6,7}]
        VLD3.32  {d4,d5,d6},[r1]!    // src[{ 0, 3},{ -, 4},{ -, -}]
        // d8-to-d15 left intact to avoid ABI required preserve and restore
        VLD3.32  {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}]
        VLD3.32  {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}]
        VLD3.32  {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}]
        VLD3.32  {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}]
        VLD3.32  {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}]
        VLD1.32  {d7},[r1]           // src[{ -,37}]
        VEXT.8   d5,d5,d21,#4        // d5  = src[{ 4,13}]
        VEXT.8   d21,d21,d27,#4      // d21 = src[{16,25}]
        VTRN.32  d7,d27              // d27 = src[{37,28}]
        VMUL.I32 d4,d4,d0[0]         // src[{ 0, 3}] * c[0]
        VMUL.I32 d5,d5,d0[1]         // src[{ 4,13}] * c[1]
        VMUL.I32 d16,d30,d3[1]       // src[{31,34}] * c[7]
        VMUL.I32 d17,d17,d0[0]       // src[{ 6, 9}] * c[0]
        VMUL.I32 d18,d18,d0[1]       // ...
        VMUL.I32 d19,d19,d1[0]
        VMUL.I32 d20,d20,d1[1]
        VMUL.I32 d21,d21,d2[0]
        VMLA.I32 d4,d22,d1[0]        // += src[{14,17}] * c[2]
        VMLA.I32 d5,d23,d1[1]        // += src[{18,21}] * c[3]
        VMLA.I32 d16,d24,d2[0]       // ...
        VMLA.I32 d17,d25,d2[1]
        VMLA.I32 d18,d26,d3[0]
        VMLA.I32 d19,d27,d3[1]
        VMLA.I32 d20,d28,d2[1]
        VMLA.I32 d21,d29,d3[0]
        VADD.I32 q2,q2,q8            // Sum all values
        VADD.I32 q3,q9,q10
        VADD.I32 q0,q2,q3
        VADD.I32 d0,d0,d1
        VPADD.I32 d0,d0,d0           // Final sum to s0
        VMOV.32  r0,d0[0]            // Move result to return value
        BX       lr                  // Return


    hth
    s.
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing->


    ADD r2,r2,#2

    VLD1.16 {D0[0]},[r2]


    Right?
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    I was facing a problem implementing something like this-
    VLD1.32 {d0[0]},[r2]

    I need to load a single 16 bit element to s0:
    VLD1.16 {s0},[r2]
    The above line gives me an error.It says only double or quads may be loaded at atime.
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    DOUBT:How may we rename the ARM register [%2] to r2?

  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    You used this command in your previous post :

        VLD1.32 { d0[0] },[r2]    // j==0, offset 0
       ADD  r2,r2,#64

    I am trying to something like this:

    VLD1.32 { d0[0] },[%2]    // [%2] is pointing to src
    ADD  %2,%2,#64     //This is NOT WORKING!! ---EDIT: IT WORKED 


      : "r"( n ), "r"( res ), "r"( src ),"r"( c ) //INPUT data


    [size=2]
    [/size]
  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    Would writing the program in intrinsics rather than asm be better(at least allow the whole program to work) ?
  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    What must ideally be the maximum number of registers that may be used  ?

    Im asking this because i inserted an neon version of nested loop in my program which used %0,% 1,%2 upto %11.
    This program worked correctly as a standalone but when put in a program consisting of 1000's of lines of code the compiler simply overlooked the asm.

    Thanks.

  • Note: This was originally posted on 6th July 2012 at http://forums.arm.com

    Well i tried a bit and found that no more than 12 variables are allowed to be designated registers.


    Is this correct ?
    If yes what is the reason for exactly 12 being allowed?
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com


    Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?

    Yes

    Also, you really might want to consider using the intrinsics instead of inline asm.

    What is the reason for this problem then?
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    I was writing a NEON code and got the following Error-


    In function 'neon_filter':
    interpol.c:176: error: impossible constraint in 'asm'


    What could be the possible reasons?

    [Are my i/o variables too many ?

    : "+r"(col),"+r"(sum),"+r"(src),"+r"(dst),"=w"( c ),"+r"(cStride),"+r"(maxVal),"=w"(shift),"+r"(offset)//0,1,2,3,4,5,6,7,8
    : "r"(width),"r"(isLast)//9,10
    : "q0", "q1", "q2","q3", "q4"
    ]
  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Thanks


    So D0 may now be addressed and filled by-

    VLD1.16 {D0[0]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[1]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[2]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[3]},[r2]  
    ?


    Yes and i do need to work on the Algorithm.
  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    Yes maxVal is 255.
    Do we use vqshrn in this case?
More questions in this forum