I have spent a long time trying to find the fastest ARM M0+ macros for C code but I am trying to write an M0+ processor and I have discovered that for an MP3 decoder, the processors ability is it's speed at the polyphase section. Just 9 lines of code, in fact.#define MC2M(x){ c1 = *coef; coef++; c2 = *coef; coef++; vLo = *(vb1+(x)); vHi = *(vb1+(23-(x))); sum1L = MADD64(sum1L, vLo, c1); sum2L = MADD64(sum2L, vLo, c2); sum1L = MADD64(sum1L, vHi, -c2); sum2L = MADD64(sum2L, vHi, c1);}Above is the original C. LDR r12,[r2],#4 ;c1 = *coef++ LDR r14,[r2],#4 ;c2 = *coef++ LDR r0,[r1,#0] ;vLo = *(vb1+(x)) LDR r3,[r1,#0x5c] ;vHi = *(vb1+(23-(x))) SMLAL r4,r5,r0,r12 ;sum1L = MADD64(sum1L, vLo, c1) SMLAL r6,r7,r0,r14 ;sumL2 = MADD64(sum2L, vLo, c2) RSB r14,r14,#0 ;-c2 SMLAL r4,r5,r3,r14 ;sum1L = MADD64(sum1L, vHi, -c2) SMLAL r6,r7,r3,r12 ;sum2L = MADD64(sum2L, vHi, c1)Above is the hand-written Thumb-2 code.Now as you can see, it's only a tiny section of code but the M0+ (Thumb) code has to manage the same feat in a situation where only r0-r7 can be used for most instructions. The only exceptions are MOV, CMP, ADD. The ADD does not set the flags so a CMP would bee needed to test for overflow. Order isn't vital so it would be possible to work on C1 and then C2.I have decided that the code will run within an NMI allowing SP to act as a register with it's own powerful addressing mode BUT it's also the most convenient and fastest way to store values that will not fit into even the hi registers.Jens provided us all with a 17-cycles 32-bit x 32-bit -->64 bit signed multiply which corrupts r0-r5 while giving the results in ro & r1. So, a second pointer is needed thus only 2 lo registers are free. I'm afraid I don''t even have a C compiler, working only in assembly language and although I have found several solutions, none are efficient and more importantly, they are ugly.The fact that only lo-lo ADD instructions modify C means that at least the high part of results would have to be in lo registers so that and ADD,CMP,ADCS would solve the overflow problem.It IS possible, it's just a nightmare to work out if there is a specific optimisation. I promise I have spent weeks looking before asking others. I now know that this fragment is basically 50% of MP3 decode time.
After some brain wrecking, I think you can save 3 cycles if you unroll:
;r0-r4 used by MULSHIFT32 ;r12 lo & r5 hi of sum1L ;r14 lo & r6 hi of sum2L ;r7 base-address of vb1+23 ;r8 c1 ;r9 c2 ;r10 vLo/vHi ;r11 address of vb1 MC2M: mov r11,r7 ; vb1+x add r7,#$5c ; vb1+(23-x) REPEAT 23 .inner_loop: pop r0-r1 ;get c1 & c2 mov r9,r1 ;store c2 mov r8,r0 ;store c1 mov r2,r11 ldmia r2!,{r0} mov r11,r2 mov r10,r0 mulshift32 add r12,r0 ; cmp r0,r12 ;sum1L += (vLo x c1) adcs r5,r1 ; mov r0,r9 ;c2 mov r1,r10 ;vLo mulshift32 add r14,r0 ; cmp r0,r14 ;sum2L += (vLo x c2) adcs r6,r1 ; ldr r0,[r7] ;vHi subs r7,r7,#4 mov r10,r0 mov r1,r9 ;-c2 neg r1,r1 ; mulshift32 add r12,r0 ; cmp r0,r12 ;sum1L += (vHi x -c2) adcs r5,r1 ; mov r0,r8 ;c1 mov r1,r10 ;vHi mulshift32 add r14,r0 ; cmp r0,r14 ;sum2L += (vHi x c1) adcs r6,r1 ; ENDR ---- 29
After the unrolling, r7 is again vb1.
Another stunning piece of code Bastian. The mulshift32 is your 32-bit x 32-bit signed multiply. Only the top 32 bits are used which is why I keep looking and looking. To calculate each 16-bit PCM output uses 32 mulshift32s. I'm sure you can see how much just 1 cycle less would save. I have looked at 'Hackers Delight', 'Bit Twiddling' and all of those places. I even looked at Karatsuba multiplication to see if the error (due to overflow when adding ints) can be pushed into the bottom 2 bits....There IS a form of simplified MP3 decoder which uses signed 16-bit x 16-bit signed multiplies but given that the SMULS takes 3-5 cycles, your software solution using 17 cycles is a lot better than people might think...I do suspect that it will have to be 100% assembly language.
I understand, you have 32 bit fix points. Do you really need the accuracy of summing up all 64 bit of the multiplication?
When writing games or demos, we make assumptions like rez of 160x102, so we know that of boundaries which cannot be crossed when doing calculations. Are there any for the decoder?