I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).
oldmulshift32: lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
movs r4,r0 //copy Factor0 [0:15]
muls r4,r2 //Factor0 lo * Factor1 lo muls r0,r1 //Factor0 lo * Factor1 hi muls r2,r3 //Factor1 lo * Factor0 hi muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)
movs r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31] lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]
Happily Jens Bauer was able to provide me with an alternative that takes only 17 cycles to execute.
mulshift32: uxth r2,r0 /* [1] Extract Factor 0 lo [0-15] lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31] lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16] uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]
movs r4,r1 /* [1] Copy Factor 1 lo [0-15]
muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi
lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16 lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16 adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16)) adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))
lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16 lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16 adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16) adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)My question therefore encompasses any alternative code fragments that take less than 17 cycles. You will note that my code uses 3 cycles to move the value in the C flag to bit 16 i.e. a carry. This bit seems the most likely to be speeded up which is why I included it.Now, recently researchers have found a new method of multiplying very large algorithms but he Karasbura method only requires 3 MULS instructions but it requires so many logic operations that in practice make it slower. The problem is that the quick-reference guide to Thumb does not explain the action of C very well. Now I DO have a solution to this. If bits 6 or 7 are set in the shift/rotate instruction, C is sett appropriately. Since I got my Cortex M0+ based SOC I have wondered why the values in bits 6 & 7 were included in the shift/rotate.... and this MAY be why. Rotating by 64 or 128 will leave the original value in the register BUT the N & C flags will be set where appropriate.I have concluded that Sophie Wilson is a genius... but isn't good in communicating the underlying logic. I think she assumes that we are all as bright as her and since it's 'obvious'. she does not mention it.Many thanks.XPS I'm aiming to write the code in Thumb4 or earlier so that it is compatible with the maximum number f devices. It certainly IS a large project but both the Chinese & Indian governments have shown great interest and with ARM helping me along, I will give it my best shot.
So swap order line6<--->7?
lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16 lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16 adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16) adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)
I took your code as is. I have found a few really useful tricks for Thumb programming which I will post to you when complete. Running code with the PSP setup means when an exception/interrupt occurs, the CPU swaps to use MSP. At that point, stack the registers which are corrupted, do the work, unstack and return...
Sean Dunlevy said:So swap order line6<--->7?
Yes!muls r3,r2 => r3 = r3*r2muls r0,r3 => r0 = r0*r3*r2
Guess, not what you want.