I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
I have only just realised that from v6T the MULS R1,R2 returns the bottom 32-bits of R1 × R2, sets Z & N appropriately but leaves C & V alone. I am intrigued to know what benefits this small change could have. Could it overcome the problem with the use of Karatsuba multiplication?In about half of multiplies used in an MP3 decoder only uses the top 32 bits of the result and although it seems silly, reducing MULSHIFT32 from 17 to 16 cycles would have a large effect in processing.It's this level of detail that makes the impossible just-about-possible.Hacker's Delight offers this C solution that returns the top 32 bits:int mulhs(int u, int v){ unsigned u0, v0, w0; int u1, v1, w1, w2, t; u0 = u & 0xFFFF; u1 = u >> 16; v0 = v & 0xFFFF; v1 = v >> 16; w0 = u0*v0; t = u1*v0 + (w0 >> 16); w1 = t & 0xFFFF; w2 = t >> 16; w1 = u0*v1 + w1; return u1*v1 + w2 + (w1 >> 16);}
When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.
I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.