This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M0/M0+/M1 32-bit x 32-bit --->64-bit signed multiply

I have been spent about 2 months trying to find a faster way of multiplying 2 32-bit numbers giving a 64-bit result. It is truly driving me mad because it FEELS like their is a faster solution. I should add that this example is for a 100% assembly-language product so I do not need to preserve R4-R15. I am also looking for the fastest way to perform a 32-bit x 32-bit multiply that returns the top 32-bits of the result. Again, I can only shave off 1 cycle using the method outlined below. Lastly, I am looking for a faster CLZ (count leading zeros) routine. Jens Bauer has posted a very promising method but once again, I have a suspicion that the extra instructions of Tv6 may offer another route. Those EXTEND and REVERSE instructions LOOK like they may offer a faster method.

lsrs r3,r0,#16 //Factor0 hi [16:31]
uxth r0,r0 //Factor0 lo [0:15]

uxth r2,r1 //Factor1 lo [0:15]
lsrs r1,r1,#16 //Factor1 hi [16:31]

mov r4,r0 //Factor0 lo * Factor1 lo
muls r4,r2 //

muls r0,r1 //Factor0 lo * Factor1 hi

muls r2,r3 //Factor1 lo * Factor0 hi

muls r1,r3 //Factor1 hi * Factor0 hi

adds r0,r2 //(Factor0 lo * factor1 hi) + (Factor1 lo * Factor0 hi)

mov r2,#0 //
adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2,r2,#16 //

lsrs r3,r0,#16 //Extract partial result [bits 16-31]

lsls r0,r0,#16 //Extract partial result [bits 0-15]

adds r0,r4 //Result [bits 0-31] + C
adcs r2,r3 //Partial [bits 16-47]
adds r1,r2 //Results [bit 32-63]

As you can see, dealing with C & shifting bits 16:47 is the really bad part. I asked Joseph Yiu (the gentleman behind the 'Definitive Guide to the ARM Cortex' series of books and he recommended that I place it within the community. He really is a nice bloke and he was interested in the idea of using a set of 'magic numbers' to setup 32-bit values in 32-bits with no literal pool. Here are a couple of examples:

MOV Rd, #<immediate>
LSLS Rd,Rd

and

MOV Rd,#<immediate>
RORS Rd,Rd

So if someone is working on GNU for the Cortex M0/M0+/M1 then maybe it's worth the time to work out the 'magic numbers''?

My last point regards the shift/rotate instructions. Nobody can explain why the forms of these instructions that shift by another register use Rs [7:0]. NOBODY can explain why this is so. I suppose that the reverse instructions allow 4 different shift values to be stored in a single resister but I have yet to find an example.

Many thanks for your time,
Sean

PS on the plus side, since interrupts have a separates set of registers, it's quite simple to use SP to address memory. It has unique addressing modes and the fantastic LDM instructions. This is so valuable for SWITCH-type statements. Setting up 8 registers in 9 cycles is powerful and if your system doesn't have DMA then it's a very efficient way of moving data.

Top replies

Sean Dunlevy over 6 years ago in reply to 42Bastian Schick +1

Your 32-bit x 32-bit -->64 bit is an interesting example of how optimized Thumb is significantly different from other scaler-CPUs. Realizing that the MULS is just 1 cycle is important (unless you have...

Parents

0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Err..... I am correct in thinking that those last 4 instructions need modding? I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like) but I also need to figure the best Mulshift32 (32-bit x 32-bit --->top 32-bits of 64-bit result).

Basically I'm trying to get 64kb/s mono MP3 & 32 kb/s mono ACELP onto the M0+ and then see how slowly I can clock the CPU.

The idea is to get the cheapest talking book possible so it's cheaper than paper for schools. Bluetooth OTC and/or USB needed but Class 1 NAND RAM is SOOO cheap. I'm hoping that PragmatIC will produce backscatter ROMs so breakfast cereals (for example) can have a short story placed onto a printed ROM so kids collect. Free to swap but the main thing is to make it affordable which means a LONG battery life as much as anything else,

With thanks,
Sean
Cancel
Up 0 Down

Cancel

Reply

0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Err..... I am correct in thinking that those last 4 instructions need modding? I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like) but I also need to figure the best Mulshift32 (32-bit x 32-bit --->top 32-bits of 64-bit result).

Basically I'm trying to get 64kb/s mono MP3 & 32 kb/s mono ACELP onto the M0+ and then see how slowly I can clock the CPU.

The idea is to get the cheapest talking book possible so it's cheaper than paper for schools. Bluetooth OTC and/or USB needed but Class 1 NAND RAM is SOOO cheap. I'm hoping that PragmatIC will produce backscatter ROMs so breakfast cereals (for example) can have a short story placed onto a printed ROM so kids collect. Free to swap but the main thing is to make it affordable which means a LONG battery life as much as anything else,

With thanks,
Sean
Cancel
Up 0 Down

Cancel

Children

+1 42Bastian Schick over 6 years ago in reply to Sean Dunlevy

Sean Dunlevy said:
I have written a faster CLZ (count leading zeros) by placing the lookup table inside the first 255 elements of ROM (zero-page like)

Didn't you say you use dbruijn algo for CLZ? I see no benefit of using the ZP.

This is what I use (of course, muls must be single-cycle, else classic CLZ is quicker):

    negs   r1, r0       // 1 mask least set bit (and test for 0)
   ands   r0, r1       // 1
   ldr   r1,so_dbruijn   // 2
   muls   r0,r1       // 1
   lsrs   r0,r0,#27   // 1
   adr   r1,so_table   // 1
   ldrb   r1,[r1,r0]   // 2

....

so_dbruijn:
   LONG 0x077CB531
so_table:
BYTE 0,8,28,2,29,14,24,3,30,22,20,15,25,17,4,8,31
BYTE 27,13,23,21,19,16, 7,26,12,18,6,11,5,10,9

Total 9 cycles.
Cancel
Up 0 Down

Cancel
0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Dear Bastian,
I stopped writing computer games in 2003 so all I can say is wow. Thumb is a more rewarding instruction set than I had previously believed. I cannot remember if I mentioned it but I use the bottom 256-elements (8,16 and 32 bit) of ROM ($00000000) as a 'zero page' since the address can be set up in a single cycle.

I don't suppose you know of a way to find the top 32-bits of the result (used by MULSHIFT32) because my best effort was only 1 instruction less than the standard 32-bit x 32-bit ---> 64-bit. I suppose nobody has considered ADPCM on an M0+ core but I'm hoping it is of utility in the MBed OS.

Many thanks,
Sean
Cancel
Up 0 Down

Cancel
0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

This is taken from 'Hacker's Delight' 8-2 'High-Order Half of 64-Bit Product'

int mulhs(int u, int v) {
unsigned u0, v0, w0;
int u1, v1, w1, w2, t;
u0 = u & 0xFFFF; u1 = u >> 16;
v0 = v & 0xFFFF; v1 = v >> 16;
w0 = u0*v0;
t = u1*v0 + (w0 >> 16);
w1 = t & 0xFFFF;
w2 = t >> 16;
w1 = u0*v1 + w1;
return u1*v1 + w2 + (w1 >> 16);
}

Which I believe turns into M0+ assembly language thus:

uxth r2,r1 //factor 1 lo
movs r3,r2

uxth r4,r0 //factor 0 lo

muls r3,r4 //f1lo * f0lo

asrs r0,#16 //factor 1 hi

muls r2,r0 //f1l0 * f0hi

asrs r1,#16 //factor 1 hi

muls r0,r1 //f0hi *f1hi

asrs r3,#16

muls r1,r4 //f0hi * f1hi

adds r3,r2
uxth r2,r3
adds r1,r2
asrs r3,#16
asrs r1,#16

adds r3,r0
adds r0,r1,r3

As I noted before, it's only 1 cycle faster than the version producing the full 64-bit product. I have to admit that I think the design of MULS is very clever and I wish their was some way to find the FLOOR using 1 multiply and then work out the remainder to add. It is perplexing indeed. I haven't found a faster one which doesn't mean that their isn't one!
Cancel
Up 0 Down

Cancel

0 42Bastian Schick over 6 years ago in reply to Sean Dunlevy

My 32x32=64 version takes also only 17 cycles!

	uxth	r2,r0			// b
	lsrs	r0,r0,#16		// a
	lsrs	r3,r1,#16		// c
	uxth 	r1,r1			// d
	movs	r4,r1			// d

	muls	r1,r2			// bd
	muls	r4,r0			// ad
	muls	r0,r3			// ac
	muls	r3,r2			// bc

	lsls	r2,r4,#16
	lsrs	r4,r4,#16
	adds	r1,r2
	adcs	r0,r4
	lsls	r2,r3,#16
	lsrs	r3,r3,#16
	adds	r1,r2
	adcs	r0,r3

0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Your 32-bit x 32-bit -->64 bit is an interesting example of how optimized Thumb is significantly different from other scaler-CPUs. Realizing that the MULS is just 1 cycle is important (unless you have the 32-cycle MULS when other kinds of multiply are faster).

Everywhere I look in the C++ files (with bits of 80x86 in-line), MULSHIFT32 confronts me. a 32-bit x 32-bit -->64 bit of which only 32-63 are wanted.

I've even toyed with finding bits the MULS that won't fit into 32 bits, shifting down appropriately. Do I need anything below 32 to get answer. You stoked my optimization button and if my 64 kb/s mono MP3 decoder will work on a 40MHz SAMD21G18 then that's energy saving right there! Since the goal is 12 hours from a single (rechargable) AA battery then great.

Audiobooks for <$5 so every child (and adult) in the developing world can share books. You don't want to see the mess that the Bluetooh, SD controller & backlit LED as well as MP3 decode makes. In the end - 1 bit of silicon. No money, fame or networking - just free knowledge to all.....
Cancel
Up +1 Down

Cancel
0 Sean Dunlevy over 4 years ago in reply to Sean Dunlevy

I have only just realised that from v6T the MULS R1,R2 returns the bottom 32-bits of R1 × R2, sets Z & N appropriately but leaves C & V alone. I am intrigued to know what benefits this small change could have. Could it overcome the problem with the use of Karatsuba multiplication?

In about half of multiplies used in an MP3 decoder only uses the top 32 bits of the result and although it seems silly, reducing MULSHIFT32 from 17 to 16 cycles would have a large effect in processing.

It's this level of detail that makes the impossible just-about-possible.

Hacker's Delight offers this C solution that returns the top 32 bits:

int mulhs(int u, int v)
{
unsigned u0, v0, w0;
int u1, v1, w1, w2, t;

u0 = u & 0xFFFF;
u1 = u >> 16;
v0 = v & 0xFFFF;
v1 = v >> 16;
w0 = u0*v0;
t = u1*v0 + (w0 >> 16);
w1 = t & 0xFFFF;
w2 = t >> 16;
w1 = u0*v1 + w1;
return u1*v1 + w2 + (w1 >> 16);
}
Cancel
Up 0 Down

Cancel
0 Sean Dunlevy over 3 years ago in reply to Sean Dunlevy

When I mentioned saving a cycle on the CLZ, Bastian Schick, I meant that if the lookup table's base address is below 0x00000100, it can be set up as an 8-bit immediate. If all of the IRQs are used, you can only use from 0x000000c0 but I am likely to only need 2 or 3 so I can use from 0x00000050 and with the De Bruijn sequences only needing 32 bytes, I could potentially get several small (but frequently used) LUTs in there. I mean, the actual table can extend from 0x000000ff onwards.

I do appreciate that it's only a tiny difference but if by some magic someone finds a faster MULSHIFT32 but it uses a large table, that would be ideal to save a cycle.
Cancel
Up 0 Down

Cancel