I have asked just everyone if their is a fast way to find the top 32-bits of a 32-bit x 32-bit multiply? There were multiply instructions that returned all 64 bits but they take 17 or 18 cycles doing something not used so:MULSHIFT32: lsrs r3, r0, #16 //Factor0 hi [16:31] uxth r0, r0 //Factor0 lo [0:15] uxth r2, r1 //Factor1 lo [0:15] lsrs r1, r1, #16 //Factor1 hi [16:31]
muls r0, r1 //Factor0 lo * Factor1 hi muls r2, r3 //Factor1 lo * Factor0 hi muls r1, r3 //Factor1 hi * Factor0 hi
adds r0, r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)
movs r2, #0 // adcs r2, r2 //C --> bit 16 (r2 contains $00000000 or $00010000) lsls r2, r2, #16 //
lsrs r3, r0, #16 //Extract partial result [bits 16-31]
adds r2, r3 //Partial [bits 16-47] adds r1, r2 //Results [bit 32-63]Now the problem I have is that I cannot find my copy of the red book (Joseph Yiu's book on programming the M0 & M0+).The fact that it currently takes 4 instructions to move C into bit 16 of a register looks like it MAY be possible to speed up so that rather than two ADDS at the end, a single ADDS Rd, Rn, Rm since all registers are low.So, now we are getting somewhere. I should add that my good friend Sarah Avory wrote the logic in C and simply tested it with every possible value to check it was correct. She was also able to save a cycle which seems tiny by todays standards, but in certain applications, the MULSHIFT32 is used millions of times a second.
Sean Dunlevy said:Sarah Avory
Ha, she seems to be a 6502 addict :-) Please guide her to the Lynx, she'd love this machine.
She wrote a Lynx game. The PC Engine was the best 6502-based console. Sarah is coding C64 and PS5 at the same time. YES, she is a crazy lady.
MULSHIFT32: lsrs r3,r0,#16 //Factor0 hi [16:31] uxth r0,r0 //Factor0 lo [0:15] uxth r2,r1 //Factor1 lo [0:15] lsrs r1,r1,#16 //Factor1 hi [16:31]
muls r0,r1 //Factor0 lo * Factor1 hi muls r2,r3 //Factor1 lo * Factor0 hi muls r1,r3 //Factor1 hi * Factor0 hi
adds r0,r2 //(Factor0 lo * Factor1 hi) + (Factor1 lo * Factor0 hi)
rors r2,r0,#1 lsrs r2,r2,#15
; movs r2,#0 //; adcs r2,r2 //C --> bit 16 (r2 contains $00000000 or $00010000); lsls r2,r2,#16 //
; lsrs r3,r0,#16 //Extract partial result [bits 16-31]
adds r2,r3 //Partial [bits 16-47] adds r1,r2 //Results [bit 32-63]The on-line quick-reference suggests that the amount to ROR must be in a register but the Keil site states that like other shifts, it can be by an immediate. If so, the above would be a 12 cycle version. Now, since all the junk between the ADDS that sets C and the need for that C has been removed so I am wondering if the C can be added to r2 or r3 before shifts?So the above is 12 cycles? Can we go faster? I can put up Sarah's C test if you like.
No, ROR is only with a register :(
So still need a clever trick to move r0.hi to r0.lo and move the C in r0.hi.
https://www.keil.com/support/man/docs/armasm/armasm_dom1361289891242.htmAt the moment I am looking at developing MP3 on Raspberry Pi Pico so I can work out how low the clock can go and still complete each frame of audio.
You should look into the armv6m reference manual. It lists not the immediate version.
Armv7-M RM:
A7.7.116 ROR (immediate)
Encoding T1 ARMv7-MROR{S}<c> <Rd>,<Rm>,#<imm5>
A7.7.117 ROR (register)
Encoding T1 All versions of the Thumb instruction set.RORS <Rdn>,<Rm> Outside IT block.ROR<c> <Rdn>,<Rm> Inside IT block.
Armv6-M RM:A6.7.54 ROR (register)
Encoding T1 All versions of the Thumb instruction set.RORS <Rdn>,<Rm>
Also :( "RORS" does move the LSB into C, but not C into the MSB.For this RRX is needed, which does not exist in Armv6-M.
It seems, ARM did not foresee that someone wants to do DSP stuff on the CM0 :-)
Very annoying that 4 of the 4 instructions are managing C/NB.Although I get about 3 MIPS on a 256kb/s MP3 stream, it's still a stretch. I will call code inside NMI so I can use SP (R14) as an extra hi register but unrolling, while simple, makes BIG code so it's going to be.... intereting ;-)
I guess ROM/FLASH is the smallest problem ;-)