I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
Sean Dunlevy said:I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like)
Didn't you say you use dbruijn algo for CLZ? I see no benefit of using the ZP.
This is what I use (of course, muls must be single-cycle, else classic CLZ is quicker):
negs r1, r0 // 1 mask least set bit (and test for 0) ands r0, r1 // 1 ldr r1,so_dbruijn // 2 muls r0,r1 // 1 lsrs r0,r0,#27 // 1 adr r1,so_table // 1 ldrb r1,[r1,r0] // 2
....
so_dbruijn: LONG 0x077CB531so_table: BYTE 0,8,28,2,29,14,24,3,30,22,20,15,25,17,4,8,31 BYTE 27,13,23,21,19,16, 7,26,12,18,6,11,5,10,9
Total 9 cycles.
Dear Bastian, I stopped writing computer games in 2003 so all I can say is wow. Thumb is a more rewarding instruction set than I had previously believed. I cannot remember if I mentioned it but I use the bottom 256-elements (8,16 and 32 bit) of ROM ($00000000) as a 'zero page' since the address can be set up in a single cycle.I don't suppose you know of a way to find the top 32-bits of the result (used by MULSHIFT32) because my best effort was only 1 instruction less than the standard 32-bit x 32-bit ---> 64-bit. I suppose nobody has considered ADPCM on an M0+ core but I'm hoping it is of utility in the MBed OS.Many thanks,Sean
This is taken from 'Hacker's Delight' 8-2 'High-Order Half of 64-Bit Product'int mulhs(int u, int v) {unsigned u0, v0, w0;int u1, v1, w1, w2, t;u0 = u & 0xFFFF; u1 = u >> 16;v0 = v & 0xFFFF; v1 = v >> 16;w0 = u0*v0;t = u1*v0 + (w0 >> 16);w1 = t & 0xFFFF;w2 = t >> 16;w1 = u0*v1 + w1;return u1*v1 + w2 + (w1 >> 16);}
Which I believe turns into M0+ assembly language thus:uxth r2,r1 //factor 1 lo movs r3,r2
uxth r4,r0 //factor 0 lo
muls r3,r4 //f1lo * f0lo asrs r0,#16 //factor 1 hi
muls r2,r0 //f1l0 * f0hi
asrs r1,#16 //factor 1 hi
muls r0,r1 //f0hi *f1hi
asrs r3,#16
muls r1,r4 //f0hi * f1hi
adds r3,r2 uxth r2,r3 adds r1,r2 asrs r3,#16 asrs r1,#16
adds r3,r0 adds r0,r1,r3As I noted before, it's only 1 cycle faster than the version producing the full 64-bit product. I have to admit that I think the design of MULS is very clever and I wish their was some way to find the FLOOR using 1 multiply and then work out the remainder to add. It is perplexing indeed. I haven't found a faster one which doesn't mean that their isn't one!
My 32x32=64 version takes also only 17 cycles!
uxth r2,r0 // b lsrs r0,r0,#16 // a lsrs r3,r1,#16 // c uxth r1,r1 // d movs r4,r1 // d muls r1,r2 // bd muls r4,r0 // ad muls r0,r3 // ac muls r3,r2 // bc lsls r2,r4,#16 lsrs r4,r4,#16 adds r1,r2 adcs r0,r4 lsls r2,r3,#16 lsrs r3,r3,#16 adds r1,r2 adcs r0,r3
Your 32-bit x 32-bit -->64 bit is an interesting example of how optimized Thumb is significantly different from other scaler-CPUs. Realizing that the MULS is just 1 cycle is important (unless you have the 32-cycle MULS when other kinds of multiply are faster).Everywhere I look in the C++ files (with bits of 80x86 in-line), MULSHIFT32 confronts me. a 32-bit x 32-bit -->64 bit of which only 32-63 are wanted.I've even toyed with finding bits the MULS that won't fit into 32 bits, shifting down appropriately. Do I need anything below 32 to get answer. You stoked my optimization button and if my 64 kb/s mono MP3 decoder will work on a 40MHz SAMD21G18 then that's energy saving right there! Since the goal is 12 hours from a single (rechargable) AA battery then great.Audiobooks for <$5 so every child (and adult) in the developing world can share books. You don't want to see the mess that the Bluetooh, SD controller & backlit LED as well as MP3 decode makes. In the end - 1 bit of silicon. No money, fame or networking - just free knowledge to all.....
I have only just realised that from v6T the MULS R1,R2 returns the bottom 32-bits of R1 × R2, sets Z & N appropriately but leaves C & V alone. I am intrigued to know what benefits this small change could have. Could it overcome the problem with the use of Karatsuba multiplication?In about half of multiplies used in an MP3 decoder only uses the top 32 bits of the result and although it seems silly, reducing MULSHIFT32 from 17 to 16 cycles would have a large effect in processing.It's this level of detail that makes the impossible just-about-possible.Hacker's Delight offers this C solution that returns the top 32 bits:int mulhs(int u, int v){ unsigned u0, v0, w0; int u1, v1, w1, w2, t; u0 = u & 0xFFFF; u1 = u >> 16; v0 = v & 0xFFFF; v1 = v >> 16; w0 = u0*v0; t = u1*v0 + (w0 >> 16); w1 = t & 0xFFFF; w2 = t >> 16; w1 = u0*v1 + w1; return u1*v1 + w2 + (w1 >> 16);}
When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.
I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.