We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
Dear Bastian, I stopped writing computer games in 2003 so all I can say is wow. Thumb is a more rewarding instruction set than I had previously believed. I cannot remember if I mentioned it but I use the bottom 256-elements (8,16 and 32 bit) of ROM ($00000000) as a 'zero page' since the address can be set up in a single cycle.I don't suppose you know of a way to find the top 32-bits of the result (used by MULSHIFT32) because my best effort was only 1 instruction less than the standard 32-bit x 32-bit ---> 64-bit. I suppose nobody has considered ADPCM on an M0+ core but I'm hoping it is of utility in the MBed OS.Many thanks,Sean