This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M0/M0+/M1 32-bit x 32-bit --->64-bit signed multiply

I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.


lsrs r3,r0,#16                //Factor0 hi [16:31]
uxth r0,r0                      //Factor0 lo [0:15]

uxth r2,r1                      //Factor1 lo [0:15]
lsrs r1,r1,#16                //Factor1 hi [16:31]

mov r4,r0                      //Factor0 lo * Factor1 lo
muls r4,r2                     //

muls r0,r1                     //Factor0 lo * Factor1 hi

muls r2,r3                     //Factor1 lo * Factor0 hi

muls r1,r3                     //Factor1 hi * Factor0 hi

adds r0,r2                     //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)


mov r2,#0                      //
adcs r2,r2                      //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2,r2,#16                 //

lsrs r3,r0,#16                 //Extract partial result [bits 16-31]

lsls r0,r0,#16                 //Extract partial result [bits 0-15]

adds r0,r4                     //Result [bits 0-31] + C
adcs r2,r3                     //Partial [bits 16-47]
adds r1,r2                     //Results [bit 32-63]


As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:

MOV Rd, #<immediate>
LSLS Rd,Rd

and

MOV Rd,#<immediate>
RORS Rd,Rd

So if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?

My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.


Many thanks for your time,
Sean

PS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.

Parents Reply Children
No data