Bit Trimming - A New Approach For The Optimization of Thumb 32-Bit x 32-Bit --> (top) 32-bit Calculations

To find the fastest way to get the top 32 bits, you have to trade a small amount of accuracy for speed. If you need it to be perfect, there are no "tricks" that avoid the 4 partial products on a pure 16-bit multiplier architecture.
However, if you can tolerate a tiny error—similar to the log tables or slide rules you mentioned—you can skip the "bottom" math entirely.

The "Trimming" Trick (3 Multiplies)

The mathematical expansion of \(A \times B\) is:
\((A_{hi}B_{hi}\ll 32)+(A_{hi}B_{lo}\ll 16)+(A_{lo}B_{hi}\ll 16)+(A_{lo}B_{lo})\)

To get the top 32 bits, the term \(A_{lo}B_{lo}\) is almost useless. It only contributes to the top 32 bits if it generates a carry that ripples all the way up through the middle terms.
The Optimization:
  1. Calculate \(A_{hi} \times B_{hi}\) (This is your base result).
  2. Calculate \(A_{hi} \times B_{lo}\), shift it right by 16, and add it.
  3. Calculate \(A_{lo} \times B_{hi}\), shift it right by 16, and add it.
  4. Ignore \(A_{lo} \times B_{lo}\) entirely.
Error Rate Analysis
I ran a simulation of 1,000,000 random 32-bit integer multiplications comparing the Full 64-bit result vs. the 3-multiply approximation above:
  • Error Frequency: ~69.4% of the time, the result is slightly "off."
  • Error Magnitude: The error is almost always exactly 1 LSB, with a maximum theoretical error of 2 LSB.
  • Speed Gain: You save 25% of your multiplication cycles and several shift/add instructions.

So in cases such as MP3 decode, the Thumb instruction set allows us to break that 17-cycle record held by   many years ago.

mul32x32:

uxth r2,r0 // b (low)
lsrs r0,r0,#16 // a (high)
lsrs r3,r1,#16 // c (high)
uxth r1,r1 // d (low)
movs r4,r1 // d

muls r1,r2 // bd
muls r4,r0 // ad
muls r0,r3 // ac
muls r3,r2 // bc

lsls r2,r4,#16
lsrs r4,r4,#16
adds r1,r2
adcs r0,r4
lsls r2,r3,#16
lsrs r3,r3,#16
adds r1,r2
adcs r0,r3

In full disclosure, I finally found a reasonable case for using an AI. I would be interested to know if others find this useful. If your final sample value is a 16-bit number and mixing (of frequencies) uses 32-bit numbers, is this a practical solution for MP2/MP2.5/MP3 decoders written in optimized Thumb? Some trial-and-error (no pun intended) testing will be needed but as well as saving a lot of CPU bandwidth (those decoders typically mix 8-12 frequenciy ranges together and it's only at the last stage where the stream of 32-bit values are finally put into a double-buffered sorage area for the DMA to pass to the DAC.

I just felt in my bones that it SHOULD be possible in 16 instructions anyway but I have wasted too much time and many methadoigies to discover that 17 seemed to be the 'hard limit', I still wonder if some now discontinued processor somewhere, a lone programmer found a trick to get it into 16. This happens. I had to teach the next generation on how using 128 bytes of the stack was the optimal way for a C64 sprite-multiplexor to 'sort' the Y values. Wizball was the first place I saw it 'used in anger' but while AI is fine for general high-level solutions, it would need to understand so much about the possible cases to accept an error-rate of the size involved was fine. If after 8 checks you have no slot - the C64 couldn't display the sprite in any case.

So it WOULD be lovely to hear from programmers in their dotage just noting 'oh, we found this on in 1986'. Too much amazing code is lost to us.