I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).

oldmulshift32:

lsrs r3,r0,#16 //Factor0 hi [16:31]

uxth r0,r0 //Factor0 lo [0:15]

uxth r2,r1 //Factor1 lo [0:15]

lsrs r1,r1,#16 //Factor1 hi [16:31]

movs r4,r0 //copy Factor0 [0:15]

muls r4,r2 //Factor0 lo * Factor1 lo

muls r0,r1 //Factor0 lo * Factor1 hi

muls r2,r3 //Factor1 lo * Factor0 hi

muls r1,r3 //Factor1 hi * Factor0 hi

adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)

movs r2,#0 //

adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000)

lsls r2,r2,#16 //

lsrs r3,r0,#16 //Extract partial result [bits 16-31]

lsls r0,r0,#16 //Extract partial result [bits 0-15]

adds r0,r4 //Result [bits 0-31] + C

adcs r2,r3 //Partial [bits 16-47]

adds r1,r2 //Results [bit 32-63]

Happily Jens Bauer was able to provide me with an alternative that takes only 17 cycles to execute.

mulshift32:

uxth r2,r0 /* [1] Extract Factor 0 lo [0-15]

lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31]

lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16]

uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]

movs r4,r1 /* [1] Copy Factor 1 lo [0-15]

muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo

muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi

muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo

muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi

lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16

lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16

adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16))

adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))

lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16

lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16

adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16)

adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)

My question therefore encompasses any alternative code fragments that take less than 17 cycles. You will note that my code uses 3 cycles to move the value in the C flag to bit 16 i.e. a carry. This bit seems the most likely to be speeded up which is why I included it.

Now, recently researchers have found a new method of multiplying very large algorithms but he Karasbura method only requires 3 MULS instructions but it requires so many logic operations that in practice make it slower. The problem is that the quick-reference guide to Thumb does not explain the action of C very well. Now I DO have a solution to this. If bits 6 or 7 are set in the shift/rotate instruction, C is sett appropriately. Since I got my Cortex M0+ based SOC I have wondered why the values in bits 6 & 7 were included in the shift/rotate.... and this MAY be why. Rotating by 64 or 128 will leave the original value in the register BUT the N & C flags will be set where appropriate.

I have concluded that Sophie Wilson is a genius... but isn't good in communicating the underlying logic. I think she assumes that we are all as bright as her and since it's 'obvious'. she does not mention it.

Many thanks.

X

PS I'm aiming to write the code in Thumb4 or earlier so that it is compatible with the maximum number f devices. It certainly IS a large project but both the Chinese & Indian governments have shown great interest and with ARM helping me along, I will give it my best shot.