I'm a bit tight on time at the moment to provide a timing analysis, though it would make for a nice blog post or a good exercise for the reader. ;-) I'll post back here when I've got further results to share.You mentioned some "other approaches"; if you've got any references I might also be able to compare to those.jpap
@ q0: index_replace_mask@ q1: data@ q2: data_max@ q3: indexes@ q4: c_0x01@ q5: indexes_max@ r0: byte_data@ r1: count0:vcgt.u8 q0, q1, q2pld [ r0, #256 ]vmax.u8 q2, q1, q2vld1.u8 { q1 }, [ r0 ]vbit.u8 q5, q3, q0subs r1, r1, #1vadd.u8 q3, q3, q4bne 0b