Implementation in NEON of non uniform address jumps

  • Note: This was originally posted on 25th July 2012 at http://forums.arm.com


    A maxVal of 255 may well map onto a single instruction.

    Could you please explain how?
    The variable Sum to be clipped is of type short int and it has to kept in the range [0,255]

    Thanks.
  • Note: This was originally posted on 11th July 2012 at http://forums.arm.com

    Ignore this.
  • Note: This was originally posted on 12th July 2012 at http://forums.arm.com

    This is what the ARM guide says-

    5.12.5. VLDn and VSTn (multiple n-element structures)
    Vector Load multiple n-element structures. It loads multiple n-element structures from memory into one or more NEON registers, with de-interleaving (unless n == 1). Every element of each register is loaded.

    Vector Store multiple n-element structures. It stores multiple n-element structures to memory from one or more NEON registers, with interleaving (unless n == 1). Every element of each register is stored.


    Syntax
    Vopn{cond}.datatype list, [Rn{@align}]{!}
    Vopn{cond}.datatype list, [Rn{@align}], Rm
    where:

    [i]op [/i]must be either LD or ST.

    [i]n [/i]must be one of 1, 2, 3, or 4.

    [i]cond [/i]is an optional condition code (see Condition codes).

    [i]datatype [/i]see Table 5.14 for options.

    [i]list [/i]specifies the NEON register list. See Table 5.14 for options.

    [i]Rn [/i]is the ARM register containing the base address. Rn cannot be r15.

    [i]align [/i]specifies an optional alignment. See Table 5.14 for options.

    !if ! is present, Rn is updated to (Rn + the number of bytes transferred by the instruction). The update occurs after all the loads or stores have taken place.

    [i]Rm [/i]is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to (Rn + Rm) after the address is used to access memory. Rm cannot be r13 or r15.


    Hence a command like -
    "vld1.16   q1, [%4] ,%5                      \n\t"
    should work.(where %4 points to an array and %5 holds an integer value.
    However i get this error.
    Error: Neon quad precision register expected -- `vld1.16 q1,[ip],r4'
  • Note: This was originally posted on 13th July 2012 at http://forums.arm.com

    Thank you for that answer.
    Well we have now discovered that the debug version of the program consisting of the ASM works good
    however the release version gets messed up due to some "O3" compiler optimizations.
    Any solutions?

  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    Thank You !!

    The clipping now looks like-


    "cmp    %10,#1                            \n\t"//if(isLast)     
          "bne    3f                                \n\t"         
          "vmin.s32   d4,d4,d13                        \n\t"
          "vmax.s32   d4,d4,d12                        \n\t"
          "3:                                          \n\t"   

        //d13 contains maxVal(255)
        //d12 contains 0


    Time consumed by this portion of the code has dropped from 223ms to 18ms
  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    Yes maxVal is 255.
    Do we use vqshrn in this case?
  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Thanks


    So D0 may now be addressed and filled by-

    VLD1.16 {D0[0]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[1]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[2]},[r2]
    ADD r2,r2,#2
    VLD1.16 {D0[3]},[r2]  
    ?


    Yes and i do need to work on the Algorithm.
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    I was writing a NEON code and got the following Error-


    In function 'neon_filter':
    interpol.c:176: error: impossible constraint in 'asm'


    What could be the possible reasons?

    [Are my i/o variables too many ?

    : "+r"(col),"+r"(sum),"+r"(src),"+r"(dst),"=w"( c ),"+r"(cStride),"+r"(maxVal),"=w"(shift),"+r"(offset)//0,1,2,3,4,5,6,7,8
    : "r"(width),"r"(isLast)//9,10
    : "q0", "q1", "q2","q3", "q4"
    ]
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com


    Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?

    Yes

    Also, you really might want to consider using the intrinsics instead of inline asm.

    What is the reason for this problem then?
  • Note: This was originally posted on 6th July 2012 at http://forums.arm.com

    Well i tried a bit and found that no more than 12 variables are allowed to be designated registers.


    Is this correct ?
    If yes what is the reason for exactly 12 being allowed?
  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    What must ideally be the maximum number of registers that may be used  ?

    Im asking this because i inserted an neon version of nested loop in my program which used %0,% 1,%2 upto %11.
    This program worked correctly as a standalone but when put in a program consisting of 1000's of lines of code the compiler simply overlooked the asm.

    Thanks.

  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    Would writing the program in intrinsics rather than asm be better(at least allow the whole program to work) ?
More questions in this forum