This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    You used this command in your previous post :

        VLD1.32 { d0[0] },[r2]    // j==0, offset 0
       ADD  r2,r2,#64

    I am trying to something like this:

    VLD1.32 { d0[0] },[%2]    // [%2] is pointing to src
    ADD  %2,%2,#64     //This is NOT WORKING!! ---EDIT: IT WORKED 


      : "r"( n ), "r"( res ), "r"( src ),"r"( c ) //INPUT data


    [size=2]
    [/size]
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    DOUBT:How may we rename the ARM register [%2] to r2?

  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    I was facing a problem implementing something like this-
    VLD1.32 {d0[0]},[r2]

    I need to load a single 16 bit element to s0:
    VLD1.16 {s0},[r2]
    The above line gives me an error.It says only double or quads may be loaded at atime.
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing->


    ADD r2,r2,#2

    VLD1.16 {D0[0]},[r2]


    Right?
  • Note: This was originally posted on 28th June 2012 at http://forums.arm.com

    Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

    src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37]
    * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]


    The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

    Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

    int sumfunc (int *c, int *src):              

        // r0 = c, r1 = src
        VLD1.32  {d0,d1,d2,d3},[r0]  // c[{0,1},{2,3},{4,5},{6,7}]
        VLD3.32  {d4,d5,d6},[r1]!    // src[{ 0, 3},{ -, 4},{ -, -}]
        // d8-to-d15 left intact to avoid ABI required preserve and restore
        VLD3.32  {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}]
        VLD3.32  {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}]
        VLD3.32  {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}]
        VLD3.32  {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}]
        VLD3.32  {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}]
        VLD1.32  {d7},[r1]           // src[{ -,37}]
        VEXT.8   d5,d5,d21,#4        // d5  = src[{ 4,13}]
        VEXT.8   d21,d21,d27,#4      // d21 = src[{16,25}]
        VTRN.32  d7,d27              // d27 = src[{37,28}]
        VMUL.I32 d4,d4,d0[0]         // src[{ 0, 3}] * c[0]
        VMUL.I32 d5,d5,d0[1]         // src[{ 4,13}] * c[1]
        VMUL.I32 d16,d30,d3[1]       // src[{31,34}] * c[7]
        VMUL.I32 d17,d17,d0[0]       // src[{ 6, 9}] * c[0]
        VMUL.I32 d18,d18,d0[1]       // ...
        VMUL.I32 d19,d19,d1[0]
        VMUL.I32 d20,d20,d1[1]
        VMUL.I32 d21,d21,d2[0]
        VMLA.I32 d4,d22,d1[0]        // += src[{14,17}] * c[2]
        VMLA.I32 d5,d23,d1[1]        // += src[{18,21}] * c[3]
        VMLA.I32 d16,d24,d2[0]       // ...
        VMLA.I32 d17,d25,d2[1]
        VMLA.I32 d18,d26,d3[0]
        VMLA.I32 d19,d27,d3[1]
        VMLA.I32 d20,d28,d2[1]
        VMLA.I32 d21,d29,d3[0]
        VADD.I32 q2,q2,q8            // Sum all values
        VADD.I32 q3,q9,q10
        VADD.I32 q0,q2,q3
        VADD.I32 d0,d0,d1
        VPADD.I32 d0,d0,d0           // Final sum to s0
        VMOV.32  r0,d0[0]            // Move result to return value
        BX       lr                  // Return


    hth
    s.
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    "VLD1.16 {D0[0]},[r2]" will load the bottom 16-bits of D0 (which are the same as the bottom 16-bits of S0).

    hth
    s.
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    Yes, though an increment of 2 bytes when loading 16-bit values could be implemented as:

      VLD1.16  {D0[0]},[r2]!
      VLD1.16  {D0[1]},[r2]!
      VLD1.16  {D0[2]},[r2]!
      VLD1.16  {D0[3]},[r2]!

    Which in itself would be better implemented as:

      VLD1.16  {D0},[r2]!
    hth
    s.
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?
    Also, you really might want to consider using the intrinsics instead of inline asm.

    hth
    s.
  • Note: This was originally posted on 6th July 2012 at http://forums.arm.com

    12 is a lot of simultaneous registers to ask the compiler for given that it only ever really had 14 to start with after the stack-pointer and program-counter were deducted.
    The additional 2 registers may already be permanently allocated for stack limit, frame pointer, global table pointer or other purposes depending on the platform requirements.

    You really should consider whether what you are doing is appropriate for inline assembly vs either a naked function / standalone assembly file or using the intrinsics.

    hth
    s.
  • Note: This was originally posted on 16th July 2012 at http://forums.arm.com

    It is reasonably unlikely that the compiler is doing something it shouldn't be at "-O3"; the more likely scenario is that:

    1. for inline assembly, you aren't describing the required/expected side-effects, so the compiler is [correctly] assuming it can optimize the code out (e.g. if your inline Neon code is writing all its results back to memory, are you declaring memory as having been modified, and/or the assembly as being "volatile"?).

    2. for C code, you are relying on something which the C standard does require to be invariant, and thus can legally be optimized out (e.g. relying on the values of variables which fall out of scope, or relying on the explicit number of memory accesses for things not declared as volatile).

    hth
    s.