Implementation in NEON of non uniform address jumps

  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing {snip}. Right?


    No. Remember this is a load-store architecture. The address is where the data comes from, the register specifier sets where the data gets stored to. You've changed the address, but not the register spec, so you've just overwritten the bottom 16-bits again.

    The first question is what do you mean by "the second half of D0[0]"? D0 is a double word register (8 bytes/64-bits). It is split into a number of lanes, depending on vector element size. In sim's example you are using a 16-bit, so 4 lanes per register. In this case you set the bottom 16-bits because you specified lane 0 (D0[0]). But this isn't half of anything - it sets ALL 16-bits of D0[0], and as there are 4 16-bit lanes this is setting 1/4th of the whole D register.

    If you want to load the second 16-bits of the register then you want D0[1].

    As I mentioned in of of your other posts, if you end up doing a lot of scalar loads you are really defeating the point of using a vector engine, so if you can restructure you algorithm so you do not to need to do this.

    Iso
  • Note: This was originally posted on 13th July 2012 at http://forums.arm.com

    Without details of what goes wrong it is very hard to provide any specific help. In general "O3" is very aggressive and assumes you have coded to the C standard (this allows it to make more assumptions and so optimize more). If any of your code is "off spec" then you'll start getting issues.

    I would suggest one of:
    * Don't use O3 if you have issues
    * Look at using "objdump" (gcc) "fromelf" RVCT to dump the disassembly to see if you can spot the issue.
    * Try some smaller C files containing each function and compile most with O2 and then test each file individually with O3. This can help narrow it down.
  • Note: This was originally posted on 11th July 2012 at http://forums.arm.com

    [color=#222222][size=2]
    Would writing the program in[/size][/color] intrinsics[color=#222222][size=2] rather than [/size][/color]asm[color=#222222][size=2] be better(at least allow the whole program to work) ?
    [/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]Using assembler is fine - but use a whole file of assembler and put it through "as" rather than trying to use the inline assembler through "gcc". The full blown assembler seems rather less fragile in my experience.[/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]I don't really use the intrinsics, so no strong opinion there![/size][/color]
  • Note: This was originally posted on 9th July 2012 at http://forums.arm.com

    In my experience unless you are writing only a couple of lines you are far better off using a full-blown dedicated assembler file, which you assemble rather than trying to wedge something through the C compiler. For any non-trivial piece of NEON the overheads of branching into the separate translation unit tend to be inconsequential.

    Iso
  • Note: This was originally posted on 25th July 2012 at http://forums.arm.com

    If you have a vector of  "signed short" (int16_t) and wish to convert to a vector of "unsigned char" (uint8_t) with corresponding saturation into the range 0 to 255 then you can use "VQMOVUN.s16". Alternatively, you could choose to simultaneously perform an optional shift by using "VQSHRUN.s16".

    hth
    s.
  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    A maxVal of 255 may well map onto a single instruction.

    hth
    s.
  • Note: This was originally posted on 17th July 2012 at http://forums.arm.com

    ...worked after using volatile before all variables...

    This probably isn't the correct solution (I suspect you just needed to tell GCC that your assembly code modified variables in memory).

    this "clipping" takes up almost half the time of execution of the loop

    The likely problem here is that you are moving values back and forth between the Neon and main register file. On most implementations there are significant performance penalties for doing so, thus where ever possible this should be avoided.

    In this particular case you can perform the clipping using Neon instructions (if you are lucky with your choice of maxVal, you may be able to convert the previous shift and clipping to a single VQSHRN instruction). For example:

      // Move constants of zero and maxVal into Neon registers
      VMOV.I16 d0,#0
      VMOV.I16 d1,#maxVal
      ...
      // Perform clipping
      VMAX.S16 d4,d4,d0  // Choose largest of zero and value
      VMIN.S16 d4,d4,d1  // Choose smallest of new value and maxVal
      ...


    hth
    s.
  • Note: This was originally posted on 16th July 2012 at http://forums.arm.com

    It is reasonably unlikely that the compiler is doing something it shouldn't be at "-O3"; the more likely scenario is that:

    1. for inline assembly, you aren't describing the required/expected side-effects, so the compiler is [correctly] assuming it can optimize the code out (e.g. if your inline Neon code is writing all its results back to memory, are you declaring memory as having been modified, and/or the assembly as being "volatile"?).

    2. for C code, you are relying on something which the C standard does require to be invariant, and thus can legally be optimized out (e.g. relying on the values of variables which fall out of scope, or relying on the explicit number of memory accesses for things not declared as volatile).

    hth
    s.
  • Note: This was originally posted on 6th July 2012 at http://forums.arm.com

    12 is a lot of simultaneous registers to ask the compiler for given that it only ever really had 14 to start with after the stack-pointer and program-counter were deducted.
    The additional 2 registers may already be permanently allocated for stack limit, frame pointer, global table pointer or other purposes depending on the platform requirements.

    You really should consider whether what you are doing is appropriate for inline assembly vs either a naked function / standalone assembly file or using the intrinsics.

    hth
    s.
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?
    Also, you really might want to consider using the intrinsics instead of inline asm.

    hth
    s.
  • Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

    Yes, though an increment of 2 bytes when loading 16-bit values could be implemented as:

      VLD1.16  {D0[0]},[r2]!
      VLD1.16  {D0[1]},[r2]!
      VLD1.16  {D0[2]},[r2]!
      VLD1.16  {D0[3]},[r2]!

    Which in itself would be better implemented as:

      VLD1.16  {D0},[r2]!
    hth
    s.
  • Note: This was originally posted on 29th June 2012 at http://forums.arm.com

    "VLD1.16 {D0[0]},[r2]" will load the bottom 16-bits of D0 (which are the same as the bottom 16-bits of S0).

    hth
    s.
More questions in this forum