I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).
oldmulshift32: lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
movs r4,r0 //copy Factor0 [0:15]
muls r4,r2 //Factor0 lo * Factor1 lo muls r0,r1 //Factor0 lo * Factor1 hi muls r2,r3 //Factor1 lo * Factor0 hi muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)
movs r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31] lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]
Happily Jens Bauer was able to provide me with an alternative that takes only 17 cycles to execute.
mulshift32: uxth r2,r0 /* [1] Extract Factor 0 lo [0-15] lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31] lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16] uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]
movs r4,r1 /* [1] Copy Factor 1 lo [0-15]
muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi
lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16 lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16 adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16)) adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))
lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16 lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16 adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16) adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)My question therefore encompasses any alternative code fragments that take less than 17 cycles. You will note that my code uses 3 cycles to move the value in the C flag to bit 16 i.e. a carry. This bit seems the most likely to be speeded up which is why I included it.Now, recently researchers have found a new method of multiplying very large algorithms but he Karasbura method only requires 3 MULS instructions but it requires so many logic operations that in practice make it slower. The problem is that the quick-reference guide to Thumb does not explain the action of C very well. Now I DO have a solution to this. If bits 6 or 7 are set in the shift/rotate instruction, C is sett appropriately. Since I got my Cortex M0+ based SOC I have wondered why the values in bits 6 & 7 were included in the shift/rotate.... and this MAY be why. Rotating by 64 or 128 will leave the original value in the register BUT the N & C flags will be set where appropriate.I have concluded that Sophie Wilson is a genius... but isn't good in communicating the underlying logic. I think she assumes that we are all as bright as her and since it's 'obvious'. she does not mention it.Many thanks.XPS I'm aiming to write the code in Thumb4 or earlier so that it is compatible with the maximum number f devices. It certainly IS a large project but both the Chinese & Indian governments have shown great interest and with ARM helping me along, I will give it my best shot.
Actually, it was me and there is a bug:
muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi => muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo
First version will destroy r3.
So swap order line6<--->7?
lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16 lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16 adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16) adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)
I took your code as is. I have found a few really useful tricks for Thumb programming which I will post to you when complete. Running code with the PSP setup means when an exception/interrupt occurs, the CPU swaps to use MSP. At that point, stack the registers which are corrupted, do the work, unstack and return...
Sean Dunlevy said:So swap order line6<--->7?
Yes!muls r3,r2 => r3 = r3*r2muls r0,r3 => r0 = r0*r3*r2
Guess, not what you want.
I do apologise Bastian. I cannot think where I got the other (i,e, wrong) name.
I have learned an important rule for the design of an MP3 decoder. Most if not all presume that the DSPs it was originally written for presumes 15 general-purpose registers and a stack pointer. Well, as you know (SP being r15), it's easy to store LR o the stack but then IP has to be stored in RAM. It's still only a few dozen lines of code to ensure routines are using the PSP (rather than MSP) so interrupts/exceptions use MSP. Of course, r0-r7 can be stored in a single instruction and in my case, since the only task the interrupts carries out is to deal with the double-buffering of digital stream.As you know, I am using a bottom-up development technique because if the DCT, IDCT & Convolution loops end up using 90% of the CPU bandwidth, the project isn't possible. Joseph Yiu seems to think that it IS possible but hinted that it would have to be written in assembly language but since Thumb was developed in 1994, the possible user base is massive. I also discovered that almost all USB sticks use M0/M0+ based CPUs to manage the DMA.Many thanks,Sean
Sean,
how close are you to the 90%? I guess it is closely related to the bit-rate/compression, right? I guess for an audio book 32KBit/s should be enough.
BTW: I was thinking forwards/backwards and in circles, but see no way of either squeezing another cycle out of the mul32x32 routine or having a multiply which delivers only the high-half for the result.But from the former code you did post, it seems, you are using the full 64bit of the result and do the rounding at the end.
The question is, do you use full 32bit or does the input data have less valid bits? Means: Do you need a general multiply or will a special one work?