This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon suitable for max() on char[]?

Parents
  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com

    It turns out you can simply use vcgt with vbit to do the "branching part":


    // Compare the 16-byte maximum with the global maximum
    vcgt.u8 d3, d0, d2 // d3[:] = (d0[:] > d2[:]) ?0xff :0x00
    // Update the global maximum if the 16-byte maximum is bigger
    vbit d2, d0, d3  // d2[:] = (d3[:] == 0xff) ?d0[:] :d2[:]

    where:
    • d0 holds the maximum value in all lanes (from the vpmax earlier),
    • d3 holds a "greater than" flag set to all-one when d0 > d2,
    • d2 holds the "current maximum" value (in all lanes).
    Turns out to be a whole lot faster than the reference C code, even with a small section of arm code following the 16-byte/iteration neon loop that determines the precise position of the maximum in the 16-byte read buffer.

    jpap
Reply
  • Note: This was originally posted on 9th September 2011 at http://forums.arm.com

    It turns out you can simply use vcgt with vbit to do the "branching part":


    // Compare the 16-byte maximum with the global maximum
    vcgt.u8 d3, d0, d2 // d3[:] = (d0[:] > d2[:]) ?0xff :0x00
    // Update the global maximum if the 16-byte maximum is bigger
    vbit d2, d0, d3  // d2[:] = (d3[:] == 0xff) ?d0[:] :d2[:]

    where:
    • d0 holds the maximum value in all lanes (from the vpmax earlier),
    • d3 holds a "greater than" flag set to all-one when d0 > d2,
    • d2 holds the "current maximum" value (in all lanes).
    Turns out to be a whole lot faster than the reference C code, even with a small section of arm code following the 16-byte/iteration neon loop that determines the precise position of the maximum in the 16-byte read buffer.

    jpap
Children
No data