vmovl.u8 Qn, Dn @ convert byte to half wordvmul.u16 Qn, Qn, Qx @ Dx contain 8 * 257
You might get away without the multiply and have a transient register (rather than one containing the constant) by shifting the widened value left and ANDing it with the unshifted version, which might be faster (untested theory) but it is technically one instruction longer, so depends on pipeline (shift and and should be "simple" operations vs a MUL but YMMV).
vmovl.u8 q1, Dnvshll.u8 q2, Dn, #8vand.u16 q1, q1, q2
vmovl.u8 q1, Dnvsli.u16 q1, q1, #8
vmov d1, d0
vext.8 d1, d0, d0, #0
// d0 = [ a b c d e f g h ]vmov d1, d0vzip.8 d0, d1
I'm looking for a solution not using extra register !