This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

32-bit x 32-bit --->64-bit multiply

I managed to produce a 32-bit x 32-bit -->64 bit code fragment that took 18 cycles to complete (it is in-line).

oldmulshift32:
lsrs r3,r0,#16 //Factor0 hi [16:31]
uxth r0,r0 //Factor0 lo [0:15]
uxth r2,r1 //Factor1 lo [0:15]
lsrs r1,r1,#16 //Factor1 hi [16:31]

movs r4,r0 //copy Factor0 [0:15]

muls r4,r2 //Factor0 lo * Factor1 lo
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi

adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)

movs r2,#0 //
adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2,r2,#16 //

lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]

adds r0,r4 //Result [bits 0-31] + C
adcs r2,r3 //Partial [bits 16-47]
adds r1,r2 //Results [bit 32-63]

Happily Jens Bauer was able to provide me with an alternative that takes only 17 cycles to execute.

mulshift32:
uxth r2,r0 /* [1] Extract Factor 0 lo [0-15]
lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31]
lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16]
uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]

movs r4,r1 /* [1] Copy Factor 1 lo [0-15]

muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo
muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi
muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo
muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi

lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16
lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16
adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16))
adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))

lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16
lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16
adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16)
adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)

My question therefore encompasses any alternative code fragments that take less than 17 cycles. You will note that my code uses 3 cycles to move the value in the C flag to bit 16 i.e. a carry. This bit seems the most likely to be speeded up which is why I included it.

Now, recently researchers have found a new method of multiplying very large algorithms but he Karasbura method only requires 3 MULS instructions but it requires so many logic operations that in practice make it slower. The problem is that the quick-reference guide to Thumb does not explain the action of C very well. Now I DO have a solution to this. If bits 6 or 7 are set in the shift/rotate instruction, C is sett appropriately. Since I got my Cortex M0+ based SOC I have wondered why the values in bits 6 & 7 were included in the shift/rotate.... and this MAY be why. Rotating by 64 or 128 will leave the original value in the register BUT the N & C flags will be set where appropriate.

I have concluded that Sophie Wilson is a genius... but isn't good in communicating the underlying logic. I think she assumes that we are all as bright as her and since it's 'obvious'. she does not mention it.

Many thanks.
X

PS I'm aiming to write the code in Thumb4 or earlier so that it is compatible with the maximum number f devices. It certainly IS a large project but both the Chinese & Indian governments have shown great interest and with ARM helping me along, I will give it my best shot.

Parents

0 42Bastian Schick over 3 years ago

Actually, it was me and there is a bug:

muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo
muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi

=>

muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi
muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo

First version will destroy r3.

Reply

0 42Bastian Schick over 3 years ago

Actually, it was me and there is a bug:

muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo
muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi

=>

muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi
muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo

First version will destroy r3.

Children

+1 Sean Dunlevy over 3 years ago in reply to 42Bastian Schick

So swap order line6<--->7?

mulshift32:
uxth r2,r0 /* [1] Extract Factor 0 lo [0-15]
lsrs r0,r0,#16 /* [1] Extract Factor 0 hi [16-31]
lsrs r3,r1,#16 /* [1] Extract Factor 1 hi [31-16]
uxth r1,r1 /* [1] Extract Factor 1 lo [0-15]

movs r4,r1 /* [1] Copy Factor 1 lo [0-15]

muls r1,r2 /* [1] Factor 1 lo x Factor 0 lo
muls r4,r0 /* [1] Factor 1 lo x Factor 0 hi
muls r3,r2 /* [1] Factor 1 hi x Factor 0 lo
muls r0,r3 /* [1] Factor 0 hi x Factor 1 hi

lsls r2,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) << 16
lsrs r4,r4,#16 /* [1] (Factor 1 lo x Factor 0 hi) >> 16
adds r1,r2 /* [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x factor 0 hi) << 16))
adcs r0,r4 /* [1] (Factor 0 hi x factor 1 hi) + ((factor 1 lo x factor 0 hi) >> 16))

lsls r2,r3,#16 /* [1] (Factor 1 hi x Factor 0 lo) << 16
lsrs r3,r3,#16 // [1] (Factor 1 hi x Factor 0 lo) >> 16
adds r1,r2 // [1] (Factor 1 lo x Factor 0 lo) + ((Factor 1 lo x Factor 0 hi) >> 16)
adcs r0,r3 // [1] (Factor 0 hi x Factor 1 hi) + ((Factor 1 hi + Factor 1 lo) >> 16)

I took your code as is. I have found a few really useful tricks for Thumb programming which I will post to you when complete. Running code with the PSP setup means when an exception/interrupt occurs, the CPU swaps to use MSP. At that point, stack the registers which are corrupted, do the work, unstack and return...
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 3 years ago in reply to Sean Dunlevy

Sean Dunlevy said:
So swap order line6<--->7?

Yes!
muls r3,r2 => r3 = r3*r2
muls r0,r3 => r0 = r0*r3*r2

Guess, not what you want.
Cancel
Up 0 Down

Cancel
0 Sean Dunlevy over 3 years ago in reply to 42Bastian Schick

I do apologise Bastian. I cannot think where I got the other (i,e, wrong) name.

I have learned an important rule for the design of an MP3 decoder. Most if not all presume that the DSPs it was originally written for presumes 15 general-purpose registers and a stack pointer. Well, as you know (SP being r15), it's easy to store LR o the stack but then IP has to be stored in RAM. It's still only a few dozen lines of code to ensure routines are using the PSP (rather than MSP) so interrupts/exceptions use MSP. Of course, r0-r7 can be stored in a single instruction and in my case, since the only task the interrupts carries out is to deal with the double-buffering of digital stream.

As you know, I am using a bottom-up development technique because if the DCT, IDCT & Convolution loops end up using 90% of the CPU bandwidth, the project isn't possible. Joseph Yiu seems to think that it IS possible but hinted that it would have to be written in assembly language but since Thumb was developed in 1994, the possible user base is massive. I also discovered that almost all USB sticks use M0/M0+ based CPUs to manage the DMA.

Many thanks,
Sean
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 3 years ago in reply to Sean Dunlevy

Sean,

how close are you to the 90%? I guess it is closely related to the bit-rate/compression, right? I guess for an audio book 32KBit/s should be enough.

BTW: I was thinking forwards/backwards and in circles, but see no way of either squeezing another cycle out of the mul32x32 routine or having a multiply which delivers only the high-half for the result.
But from the former code you did post, it seems, you are using the full 64bit of the result and do the rounding at the end.

The question is, do you use full 32bit or does the input data have less valid bits? Means: Do you need a general multiply or will a special one work?
Cancel
Up 0 Down

Cancel