This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

MULSHIFT32 in 14 cycles

I have asked just everyone if their is a fast way to find the top 32-bits of a 32-bit x 32-bit multiply? There were multiply instructions that returned all 64 bits but they take 17 or 18 cycles doing something not used so:

MULSHIFT32:
lsrs r3, r0, #16 //Factor0 hi [16:31]
uxth r0, r0 //Factor0 lo [0:15]
uxth r2, r1 //Factor1 lo [0:15]
lsrs r1, r1, #16 //Factor1 hi [16:31]

muls r0, r1 //Factor0 lo * Factor1 hi
muls r2, r3 //Factor1 lo * Factor0 hi
muls r1, r3 //Factor1 hi * Factor0 hi

adds r0, r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)

movs r2, #0 //
adcs r2, r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2, r2, #16 //

lsrs r3, r0, #16 //Extract partial result [bits 16-31]

adds r2, r3 //Partial [bits 16-47]
adds r1, r2 //Results [bit 32-63]

Now the problem I have is that I cannot find my copy of the red book (Joseph Yiu's book on programming the M0 & M0+).The fact that it currently takes 4 instructions to move C into bit 16 of a register looks like it MAY be possible to speed up so that rather than two ADDS at the end, a single ADDS Rd, Rn, Rm since all registers are low.

So, now we are getting somewhere. I should add that my good friend Sarah Avory wrote the logic in C and simply tested it with every possible value to check it was correct. She was also able to save a cycle which seems tiny by todays standards, but in certain applications, the MULSHIFT32 is used millions of times a second.

Top replies

42Bastian Schick over 4 years ago in reply to Sean Dunlevy +1 verified

I guess ROM/FLASH is the smallest problem ;-)

+1 42Bastian Schick over 4 years ago in reply to Sean Dunlevy

I guess ROM/FLASH is the smallest problem ;-)
Cancel
Vote up +1 Vote down

Cancel
0 Sean Dunlevy over 4 years ago in reply to 42Bastian Schick

Very annoying that 4 of the 4 instructions are managing C/NB.

Although I get about 3 MIPS on a 256kb/s MP3 stream, it's still a stretch. I will call code inside NMI so I can use SP (R14) as an extra hi register but unrolling, while simple, makes BIG code so it's going to be.... intereting ;-)
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to Sean Dunlevy

Also :( "RORS" does move the LSB into C, but not C into the MSB.
For this RRX is needed, which does not exist in Armv6-M.

It seems, ARM did not foresee that someone wants to do DSP stuff on the CM0 :-)
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to Sean Dunlevy

You should look into the armv6m reference manual. It lists not the immediate version.

Armv7-M RM:

A7.7.116   ROR (immediate)

Encoding T1 ARMv7-M
ROR{S}<c> <Rd>,<Rm>,#<imm5>

A7.7.117   ROR (register)

Encoding T1 All versions of the Thumb instruction set.
RORS <Rdn>,<Rm> Outside IT block.
ROR<c> <Rdn>,<Rm> Inside IT block.

Armv6-M RM:
A6.7.54   ROR (register)

Encoding T1 All versions of the Thumb instruction set.
RORS <Rdn>,<Rm>
Cancel
Vote up 0 Vote down

Cancel
0 Sean Dunlevy over 4 years ago in reply to 42Bastian Schick

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289891242.htm

At the moment I am looking at developing MP3 on Raspberry Pi Pico so I can work out how low the clock can go and still complete each frame of audio.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to Sean Dunlevy

No, ROR is only with a register :(

So still need a clever trick to move r0.hi to r0.lo and move the C in r0.hi.
Cancel
Vote up 0 Vote down

Cancel
0 Sean Dunlevy over 4 years ago in reply to 42Bastian Schick

MULSHIFT32:
lsrs r3,r0,#16 //Factor0 hi [16:31]
uxth r0,r0 //Factor0 lo [0:15]
uxth r2,r1 //Factor1 lo [0:15]
lsrs r1,r1,#16 //Factor1 hi [16:31]

muls r0,r1 //Factor0 lo * Factor1 hi
muls r2,r3 //Factor1 lo * Factor0 hi
muls r1,r3 //Factor1 hi * Factor0 hi

adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)

rors r2,r0,#1
lsrs r2,r2,#15

; movs r2,#0 //
; adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
; lsls r2,r2,#16 //

; lsrs r3,r0,#16 //Extract partial result [bits 16-31]

adds r2,r3 //Partial [bits 16-47]
adds r1,r2 //Results [bit 32-63]

The on-line quick-reference suggests that the amount to ROR must be in a register but the Keil site states that like other shifts, it can be by an immediate. If so, the above would be a 12 cycle version. Now, since all the junk between the ADDS that sets C and the need for that C has been removed so I am wondering if the C can be added to r2 or r3 before shifts?

So the above is 12 cycles? Can we go faster? I can put up Sarah's C test if you like.
Cancel
Vote up 0 Vote down

Cancel
0 Sean Dunlevy over 4 years ago in reply to 42Bastian Schick

She wrote a Lynx game. The PC Engine was the best 6502-based console. Sarah is coding C64 and PS5 at the same time. YES, she is a crazy lady.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago

Sean Dunlevy said:
Sarah Avory

Ha, she seems to be a 6502 addict :-) Please guide her to the Lynx, she'd love this machine.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago

Sean Dunlevy said:
movs r2, #0 //
adcs r2, r2 //C --> bit 16 (r2 contains $00000000 or $00010000)
lsls r2, r2, #16 //

lsrs r3, r0, #16 //Extract partial result [bits 16-31]

adds r2, r3 //Partial [bits 16-47]

Hmm, I see the point. If r0.lo could be cleared, then the C could be stored and a ror #16 could move r0.lo down and the C in bit 16.
Cancel
Vote up 0 Vote down

Cancel