I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.
lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
mov r4,r0 //Factor0 lo * Factor1 lo muls r4,r2 //
muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)
mov r2,#0 // adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2,r2,#16 //
lsrs r3,r0,#16 //Extract partial result [bits 16-31]
lsls r0,r0,#16 //Extract partial result [bits 0-15]
adds r0,r4 //Result [bits 0-31] + C adcs r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:MOV Rd, #<immediate>LSLS Rd,RdandMOV Rd,#<immediate>RORS Rd,RdSo if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.Many thanks for your time,SeanPS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.
You can save one cycle (in the carry part):
// ab*cd // ac // ad // bc // bd // ------ mul32x32: mov r12,r4 // for EABI uxth r2,r0 // b lsrs r0,r0,#16 // a lsrs r3,r1,#16 // c uxth r1,r1 // d movs r4,r1 // d muls r1,r2 // bd muls r4,r0 // ad muls r0,r3 // ac muls r3,r2 // bc lsls r2,r4,#16 lsrs r4,r4,#16 adds r1,r2 adcs r0,r4 lsls r2,r3,#16 lsrs r3,r3,#16 adds r1,r2 adcs r0,r3 // for EABI mov r4,r12 bx lr
3 instructions needed for EABI compliance (saving of r4 and bx lr for return)
Err..... I am correct in thinking that those last 4 instructions need modding? I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like) but I also need to figure the best Mulshift32 (32-bit x 32-bit --->top 32-bits of 64-bit result).Basically I'm trying to get 64kb/s mono MP3 & 32 kb/s mono ACELP onto the M0+ and then see how slowly I can clock the CPU.The idea is to get the cheapest talking book possible so it's cheaper than paper for schools. Bluetooth OTC and/or USB needed but Class 1 NAND RAM is SOOO cheap. I'm hoping that PragmatIC will produce backscatter ROMs so breakfast cereals (for example) can have a short story placed onto a printed ROM so kids collect. Free to swap but the main thing is to make it affordable which means a LONG battery life as much as anything else,With thanks,Sean
Sean Dunlevy said:I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like)
Didn't you say you use dbruijn algo for CLZ? I see no benefit of using the ZP.
This is what I use (of course, muls must be single-cycle, else classic CLZ is quicker):
negs r1, r0 // 1 mask least set bit (and test for 0) ands r0, r1 // 1 ldr r1,so_dbruijn // 2 muls r0,r1 // 1 lsrs r0,r0,#27 // 1 adr r1,so_table // 1 ldrb r1,[r1,r0] // 2
....
so_dbruijn: LONG 0x077CB531so_table: BYTE 0,8,28,2,29,14,24,3,30,22,20,15,25,17,4,8,31 BYTE 27,13,23,21,19,16, 7,26,12,18,6,11,5,10,9
Total 9 cycles.
Dear Bastian, I stopped writing computer games in 2003 so all I can say is wow. Thumb is a more rewarding instruction set than I had previously believed. I cannot remember if I mentioned it but I use the bottom 256-elements (8,16 and 32 bit) of ROM ($00000000) as a 'zero page' since the address can be set up in a single cycle.I don't suppose you know of a way to find the top 32-bits of the result (used by MULSHIFT32) because my best effort was only 1 instruction less than the standard 32-bit x 32-bit ---> 64-bit. I suppose nobody has considered ADPCM on an M0+ core but I'm hoping it is of utility in the MBed OS.Many thanks,Sean
View all questions in Cortex-M / M-Profile forum