I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
Dear Bastian, I stopped writing computer games in 2003 so all I can say is wow. Thumb is a more rewarding instruction set than I had previously believed. I cannot remember if I mentioned it but I use the bottom 256-elements (8,16 and 32 bit) of ROM ($00000000) as a 'zero page' since the address can be set up in a single cycle.I don't suppose you know of a way to find the top 32-bits of the result (used by MULSHIFT32) because my best effort was only 1 instruction less than the standard 32-bit x 32-bit ---> 64-bit. I suppose nobody has considered ADPCM on an M0+ core but I'm hoping it is of utility in the MBed OS.Many thanks,Sean