This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon suitable for max() on char[]?

  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com

    It turns out you can simply use vcgt with vbit to do the "branching part":


    // Compare the 16-byte maximum with the global maximum
    vcgt.u8 d3, d0, d2 // d3[:] = (d0[:] > d2[:]) ?0xff :0x00
    // Update the global maximum if the 16-byte maximum is bigger
    vbit d2, d0, d3  // d2[:] = (d3[:] == 0xff) ?d0[:] :d2[:]

    where:
    • d0 holds the maximum value in all lanes (from the vpmax earlier),
    • d3 holds a "greater than" flag set to all-one when d0 > d2,
    • d2 holds the "current maximum" value (in all lanes).
    Turns out to be a whole lot faster than the reference C code, even with a small section of arm code following the 16-byte/iteration neon loop that determines the precise position of the maximum in the 16-byte read buffer.

    jpap
  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com

    I'm a bit tight on time at the moment to provide a timing analysis, though it would make for a nice blog post or a good exercise for the reader. ;-)  I'll post back here when I've got further results to share.

    You mentioned some "other approaches"; if you've got any references I might also be able to compare to those.

    jpap
  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com

    Could you the post the code and/or performance numbers (preferably in cycles/byte)? I'm curious how it'd compare with some other approaches.
  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com


    I'm a bit tight on time at the moment to provide a timing analysis, though it would make for a nice blog post or a good exercise for the reader. ;-)  I'll post back here when I've got further results to share.

    You mentioned some "other approaches"; if you've got any references I might also be able to compare to those.

    jpap


    Exercise for the reader sounds fun.

    I don't have any references (and I haven't tested anything yet) but I'm basically thinking to slice up the array into 16 interleaved arrays, and find the max byte + indexes of each slice vertically instead of pair-wise. You would store a single byte for each index, which well let you work over 256 * 16 iterations or 4KB of data. Then you would perform horizontal maximums/comparisons to find the maximum over 4KB.  You'd have to expand out the indexes in order to recover the lower 4 bits, which are implied by their position in the vector. For arrays over 4KB you'd split it into up to 4KB chunks and combine the maximums afterwards.

    It might make sense to perform some multiple of 16 per iteration instead of 16 to hide latency if necessary (from what I can see you can probably get away with not doing it), or to make the loop space greater than 4KB. You can actually skip computing the maxima altogether if you just keep around the indexes, and when you're done load up the values from the indexes. But this is more cumbersome because of moving from NEON registers to ARM to perform the loads, potentially 16 (or more if you go wider). This wouldn't be a clear win for small arrays, and for large enough arrays you'd be bandwidth limited so the extra vmax instruction per iteration won't hurt you.

    The reason why this is preferable over the pair-wise max version is that those instructions form a serial dependency chain so you eat a lot in latency, and because they get less efficient with each succession. It'd be great if NEON had a horizontal max instruction, but unfortunately it doesn't (and that wouldn't help you with finding the index anyway).

    Inner loop would look something like this:


    @ q0: index_replace_mask
    @ q1: data
    @ q2: data_max
    @ q3: indexes
    @ q4: c_0x01
    @ q5: indexes_max

    @ r0: byte_data
    @ r1: count

    0:

    vcgt.u8 q0, q1, q2
    pld [ r0, #256 ]

    vmax.u8 q2, q1, q2
    vld1.u8 { q1 }, [ r0 ]

    vbit.u8 q5, q3, q0
    subs r1, r1, #1

    vadd.u8 q3, q3, q4
    bne 0b