Bit Trimming - A New Approach For The Optimization of Thumb 32-Bit x 32-Bit --> (top) 32-bit Calculations

Sean Dunlevy 21 days ago

To find the fastest way to get the top 32 bits, you have to trade a small amount of accuracy for speed. If you need it to be perfect, there are no "tricks" that avoid the 4 partial products on a pure 16-bit multiplier architecture.

However, if you can tolerate a tiny error—similar to the log tables or slide rules you mentioned—you can skip the "bottom" math entirely.

The "Trimming" Trick (3 Multiplies)

The mathematical expansion of \(A \times B\) is:
\((A_{hi}B_{hi}\ll 32)+(A_{hi}B_{lo}\ll 16)+(A_{lo}B_{hi}\ll 16)+(A_{lo}B_{lo})\)

To get the top 32 bits, the term \(A_{lo}B_{lo}\) is almost useless. It only contributes to the top 32 bits if it generates a carry that ripples all the way up through the middle terms.

The Optimization:

Calculate \(A_{hi} \times B_{hi}\) (This is your base result).
Calculate \(A_{hi} \times B_{lo}\), shift it right by 16, and add it.
Calculate \(A_{lo} \times B_{hi}\), shift it right by 16, and add it.
Ignore \(A_{lo} \times B_{lo}\) entirely.

Error Rate Analysis

I ran a simulation of 1,000,000 random 32-bit integer multiplications comparing the Full 64-bit result vs. the 3-multiply approximation above:

Error Frequency: ~69.4% of the time, the result is slightly "off."
Error Magnitude: The error is almost always exactly 1 LSB, with a maximum theoretical error of 2 LSB.
Speed Gain: You save 25% of your multiplication cycles and several shift/add instructions.

So in cases such as MP3 decode, the Thumb instruction set allows us to break that 17-cycle record held by 42Bastian Schick many years ago.

mul32x32:

uxth r2,r0 // b (low)
lsrs r0,r0,#16 // a (high)
lsrs r3,r1,#16 // c (high)
uxth r1,r1 // d (low)
movs r4,r1 // d

muls r1,r2 // bd
muls r4,r0 // ad
muls r0,r3 // ac
muls r3,r2 // bc

lsls r2,r4,#16
lsrs r4,r4,#16
adds r1,r2
adcs r0,r4
lsls r2,r3,#16
lsrs r3,r3,#16
adds r1,r2
adcs r0,r3

In full disclosure, I finally found a reasonable case for using an AI. I would be interested to know if others find this useful. If your final sample value is a 16-bit number and mixing (of frequencies) uses 32-bit numbers, is this a practical solution for MP2/MP2.5/MP3 decoders written in optimized Thumb? Some trial-and-error (no pun intended) testing will be needed but as well as saving a lot of CPU bandwidth (those decoders typically mix 8-12 frequenciy ranges together and it's only at the last stage where the stream of 32-bit values are finally put into a double-buffered sorage area for the DMA to pass to the DAC.

I just felt in my bones that it SHOULD be possible in 16 instructions anyway but I have wasted too much time and many methadoigies to discover that 17 seemed to be the 'hard limit', I still wonder if some now discontinued processor somewhere, a lone programmer found a trick to get it into 16. This happens. I had to teach the next generation on how using 128 bytes of the stack was the optimal way for a C64 sprite-multiplexor to 'sort' the Y values. Wizball was the first place I saw it 'used in anger' but while AI is fine for general high-level solutions, it would need to understand so much about the possible cases to accept an error-rate of the size involved was fine. If after 8 checks you have no slot - the C64 couldn't display the sprite in any case.

So it WOULD be lovely to hear from programmers in their dotage just noting 'oh, we found this on in 1986'. Too much amazing code is lost to us.

Top replies

0 Sean Dunlevy 21 days ago

asm

    ; Split inputs (5 instructions)
    uxth    r2, r0          ; r2 = Alo
    lsrs    r0, r0, #16     ; r0 = Ahi
    uxth    r3, r1          ; r3 = Blo
    lsrs    r1, r1, #16     ; r1 = Bhi
    
    ; Perform 3 Multiplies (3 instructions)
    movs    r4, r0          ; Keep Ahi for both middle products
    muls    r4, r3          ; r4 = Ahi * Blo (ad)
    muls    r3, r1          ; r3 = Alo * Bhi (bc)
    muls    r0, r1          ; r0 = Ahi * Bhi (ac) - The base result

    ; Rounding & Summing (4 instructions)
    adds    r4, r3          ; r4 = ad + bc
    movs    r2, #128        ; Load 0x80 (for rounding)
    lsls    r2, r2, #8      ; r2 = 0x8000 (The "halfway" carry bit)
    adds    r4, r2          ; Inject rounding bit into middle sum
    
    ; Final Shift & Accumulate
    lsrs    r4, r4, #16     ; Shift middle sum to align with top 32 bits
    adds    r0, r4          ; Final Result is in r0

Again, I resorted to AI asking it if there was a parttern in the inaccuracy.

Analysis of This Approach

Instruction Count: Reduced from 17 to 12. On an ARM7TDMI or Cortex-M0, this is a ~30% speed increase for your core multiplication loop.
Why the Rounding Works: The stochastic test shows that without the adds r4, r2 step, you are almost always 1 LSB low. By adding 0x8000 (the 15th bit) before the shift, you statistically "round up" when the ignored \(bd\) product and the lower bits of \(ad+bc\) would have likely generated a carry.
Error Rate: The maximum error is still 1 LSB, but the mean error becomes nearly zero. For audio decoders (fixed-point MP3), this is generally considered a higher-quality result than simple truncation because it prevents cumulative negative bias in the synthesis filterbank.

0 42Bastian Schick 21 days ago in reply to Sean Dunlevy

Nice approach. Real hacking spirit!
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel
0 Sean Dunlevy 21 days ago in reply to 42Bastian Schick

Bastian - I posted it partly because WE sort of 'felt' it could be faster, but also, I wanted to check you were OK.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel
0 42Bastian Schick 18 days ago in reply to Sean Dunlevy

Sean, yes, I am ok, but lately not that active in the forum.
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel