This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    ...worked after using volatile before all variables...

    This probably isn't the correct solution (I suspect you just needed to tell GCC that your assembly code modified variables in memory).

    this "clipping" takes up almost half the time of execution of the loop

    The likely problem here is that you are moving values back and forth between the Neon and main register file. On most implementations there are significant performance penalties for doing so, thus where ever possible this should be avoided.

    In this particular case you can perform the clipping using Neon instructions (if you are lucky with your choice of maxVal, you may be able to convert the previous shift and clipping to a single VQSHRN instruction). For example:

      // Move constants of zero and maxVal into Neon registers
      VMOV.I16 d0,#0
      VMOV.I16 d1,#maxVal
      ...
      // Perform clipping
      VMAX.S16 d4,d4,d0  // Choose largest of zero and value
      VMIN.S16 d4,d4,d1  // Choose smallest of new value and maxVal
      ...


    hth
    s.
  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    A maxVal of 255 may well map onto a single instruction.

    hth
    s.
  • Note: This was originally posted on 25th July 2012 at http://forums.arm.com

    If you have a vector of  "signed short" (int16_t) and wish to convert to a vector of "unsigned char" (uint8_t) with corresponding saturation into the range 0 to 255 then you can use "VQMOVUN.s16". Alternatively, you could choose to simultaneously perform an optional shift by using "VQSHRUN.s16".

    hth
    s.
  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    In my experience unless you are writing only a couple of lines you are far better off using a full-blown dedicated assembler file, which you assemble rather than trying to wedge something through the C compiler. For any non-trivial piece of NEON the overheads of branching into the separate translation unit tend to be inconsequential.

    Iso
  • Note: This was originally posted on 11th July 2012 at http://forums.arm.com

    [color=#222222][size=2]
    Would writing the program in[/size][/color] intrinsics[color=#222222][size=2] rather than [/size][/color]asm[color=#222222][size=2] be better(at least allow the whole program to work) ?
    [/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]Using assembler is fine - but use a whole file of assembler and put it through "as" rather than trying to use the inline assembler through "gcc". The full blown assembler seems rather less fragile in my experience.[/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]I don't really use the intrinsics, so no strong opinion there![/size][/color]
  • Note: This was originally posted on 13th July 2012 at http://forums.arm.com

    Without details of what goes wrong it is very hard to provide any specific help. In general "O3" is very aggressive and assumes you have coded to the C standard (this allows it to make more assumptions and so optimize more). If any of your code is "off spec" then you'll start getting issues.

    I would suggest one of:
    * Don't use O3 if you have issues
    * Look at using "objdump" (gcc) "fromelf" RVCT to dump the disassembly to see if you can spot the issue.
    * Try some smaller C files containing each function and compile most with O2 and then test each file individually with O3. This can help narrow it down.
  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing {snip}. Right?


    No. Remember this is a load-store architecture. The address is where the data comes from, the register specifier sets where the data gets stored to. You've changed the address, but not the register spec, so you've just overwritten the bottom 16-bits again.

    The first question is what do you mean by "the second half of D0[0]"? D0 is a double word register (8 bytes/64-bits). It is split into a number of lanes, depending on vector element size. In sim's example you are using a 16-bit, so 4 lanes per register. In this case you set the bottom 16-bits because you specified lane 0 (D0[0]). But this isn't half of anything - it sets ALL 16-bits of D0[0], and as there are 4 16-bit lanes this is setting 1/4th of the whole D register.

    If you want to load the second 16-bits of the register then you want D0[1].

    As I mentioned in of of your other posts, if you end up doing a lot of scalar loads you are really defeating the point of using a vector engine, so if you can restructure you algorithm so you do not to need to do this.

    Iso