So in cases such as MP3 decode, the Thumb instruction set allows us to break that 17-cycle record held by 42Bastian Schick many years ago.
mul32x32:
uxth r2,r0 // b (low) lsrs r0,r0,#16 // a (high) lsrs r3,r1,#16 // c (high) uxth r1,r1 // d (low) movs r4,r1 // d
muls r1,r2 // bd muls r4,r0 // ad muls r0,r3 // ac muls r3,r2 // bc
lsls r2,r4,#16 lsrs r4,r4,#16 adds r1,r2 adcs r0,r4 lsls r2,r3,#16 lsrs r3,r3,#16 adds r1,r2 adcs r0,r3
In full disclosure, I finally found a reasonable case for using an AI. I would be interested to know if others find this useful. If your final sample value is a 16-bit number and mixing (of frequencies) uses 32-bit numbers, is this a practical solution for MP2/MP2.5/MP3 decoders written in optimized Thumb? Some trial-and-error (no pun intended) testing will be needed but as well as saving a lot of CPU bandwidth (those decoders typically mix 8-12 frequenciy ranges together and it's only at the last stage where the stream of 32-bit values are finally put into a double-buffered sorage area for the DMA to pass to the DAC.I just felt in my bones that it SHOULD be possible in 16 instructions anyway but I have wasted too much time and many methadoigies to discover that 17 seemed to be the 'hard limit', I still wonder if some now discontinued processor somewhere, a lone programmer found a trick to get it into 16. This happens. I had to teach the next generation on how using 128 bytes of the stack was the optimal way for a C64 sprite-multiplexor to 'sort' the Y values. Wizball was the first place I saw it 'used in anger' but while AI is fine for general high-level solutions, it would need to understand so much about the possible cases to accept an error-rate of the size involved was fine. If after 8 checks you have no slot - the C64 couldn't display the sprite in any case.So it WOULD be lovely to hear from programmers in their dotage just noting 'oh, we found this on in 1986'. Too much amazing code is lost to us.