I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).
oldmulshift32:
lsrs r3,r0,#16 //Factor0 hi [16:31]
uxth r0,r0 //Factor0 lo [0:15]
uxth r2,r1 //Factor1 lo [0:15]
lsrs r1,r1,#16 //Factor1…
