// Compare the 16-byte maximum with the global maximumvcgt.u8 d3, d0, d2 // d3[:] = (d0[:] > d2[:]) ?0xff :0x00// Update the global maximum if the 16-byte maximum is biggervbit d2, d0, d3 // d2[:] = (d3[:] == 0xff) ?d0[:] :d2[:]
I'm a bit tight on time at the moment to provide a timing analysis, though it would make for a nice blog post or a good exercise for the reader. ;-) I'll post back here when I've got further results to share.You mentioned some "other approaches"; if you've got any references I might also be able to compare to those.jpap
@ q0: index_replace_mask@ q1: data@ q2: data_max@ q3: indexes@ q4: c_0x01@ q5: indexes_max@ r0: byte_data@ r1: count0:vcgt.u8 q0, q1, q2pld [ r0, #256 ]vmax.u8 q2, q1, q2vld1.u8 { q1 }, [ r0 ]vbit.u8 q5, q3, q0subs r1, r1, #1vadd.u8 q3, q3, q4bne 0b