This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

M4 (Thumb2) to M0+ (Thumb) assembly language

I have spent a long time trying to find the fastest ARM M0+ macros for C code but I am trying to write an M0+ processor and I have discovered that for an MP3 decoder, the processors ability is it's speed at the polyphase section. Just 9 lines of code, in fact.

#define MC2M(x)
{
c1 = *coef;
coef++;
c2 = *coef;
coef++;
vLo = *(vb1+(x));
vHi = *(vb1+(23-(x)));
sum1L = MADD64(sum1L, vLo, c1);
sum2L = MADD64(sum2L, vLo, c2);
sum1L = MADD64(sum1L, vHi, -c2);
sum2L = MADD64(sum2L, vHi, c1);
}

Above is the original C.

LDR r12,[r2],#4 ;c1 = *coef++
LDR r14,[r2],#4 ;c2 = *coef++
LDR r0,[r1,#0] ;vLo = *(vb1+(x))
LDR r3,[r1,#0x5c] ;vHi = *(vb1+(23-(x)))
SMLAL r4,r5,r0,r12 ;sum1L = MADD64(sum1L, vLo, c1)
SMLAL r6,r7,r0,r14 ;sumL2 = MADD64(sum2L, vLo, c2)
RSB r14,r14,#0 ;-c2
SMLAL r4,r5,r3,r14 ;sum1L = MADD64(sum1L, vHi, -c2)
SMLAL r6,r7,r3,r12 ;sum2L = MADD64(sum2L, vHi, c1)

Above is the hand-written Thumb-2 code.

Now as you can see, it's only a tiny section of code but the M0+ (Thumb) code has to manage the same feat in a situation where only r0-r7 can be used for most instructions. The only exceptions are MOV, CMP, ADD. The ADD does not set the flags so a CMP would bee needed to test for overflow. Order isn't vital so it would be possible to work on C1 and then C2.

I have decided that the code will run within an NMI allowing SP to act as a register with it's own powerful addressing mode BUT it's also the most convenient and fastest way to store values that will not fit into even the hi registers.

Jens provided us all with a 17-cycles 32-bit x 32-bit -->64 bit signed multiply which corrupts r0-r5 while giving the results in ro & r1. So, a second pointer is needed thus only 2 lo registers are free. 

I'm afraid I don''t even have a C compiler, working only in assembly language and although I have found several solutions, none are efficient and more importantly, they are ugly.

The fact that only lo-lo ADD instructions modify C means that at least the high part of results would have to be in lo registers so that and ADD,CMP,ADCS would solve the overflow problem.

It IS possible, it's just a nightmare to work out if there is a specific optimisation. I promise I have spent weeks looking before asking others. I now know that this fragment is basically 50% of MP3 decode time.



Parents
  • Jens provided us all with a 17-cycles 32-bit x 32-bit -->64 bit signed multiply which corrupts r0-r5 while giving the results in ro & r1

    My version uses r0-r4 only:

        mov	r12,r4
    	//  17 cycles (if muls takes 1 cycle)
    	// 141 cycles (if muls takes 32 cycles)
    	//
    	// ab*cd
    	// ac
    	//  ad
    	//  bc
    	//   bd
    	// ------
    
    	uxth	r2,r0			// b
    	lsrs	r0,r0,#16		// a
    	lsrs	r3,r1,#16		// c
    	uxth	r1,r1			// d
    	movs	r4,r1			// d
    
    	muls	r1,r2			// bd
    	muls	r4,r0			// ad
    	muls	r0,r3			// ac
    	muls	r3,r2			// bc
    
    	lsls	r2,r4,#16		// ad => d0
    	lsrs	r4,r4,#16		// ad => 0a
    	adds	r1,r1,r2		// bd + d0
    	adcs	r0,r4			// ac + 0a + C
    	lsls	r2,r3,#16		// bc => c0
    	lsrs	r3,r3,#16		// bc => 0b
    	adds	r1,r1,r2		// bd + c0
    	adcs	r0,r3			// ac + 0b + C
    
    	mov	r4,r12

Reply
  • Jens provided us all with a 17-cycles 32-bit x 32-bit -->64 bit signed multiply which corrupts r0-r5 while giving the results in ro & r1

    My version uses r0-r4 only:

        mov	r12,r4
    	//  17 cycles (if muls takes 1 cycle)
    	// 141 cycles (if muls takes 32 cycles)
    	//
    	// ab*cd
    	// ac
    	//  ad
    	//  bc
    	//   bd
    	// ------
    
    	uxth	r2,r0			// b
    	lsrs	r0,r0,#16		// a
    	lsrs	r3,r1,#16		// c
    	uxth	r1,r1			// d
    	movs	r4,r1			// d
    
    	muls	r1,r2			// bd
    	muls	r4,r0			// ad
    	muls	r0,r3			// ac
    	muls	r3,r2			// bc
    
    	lsls	r2,r4,#16		// ad => d0
    	lsrs	r4,r4,#16		// ad => 0a
    	adds	r1,r1,r2		// bd + d0
    	adcs	r0,r4			// ac + 0a + C
    	lsls	r2,r3,#16		// bc => c0
    	lsrs	r3,r3,#16		// bc => 0b
    	adds	r1,r1,r2		// bd + c0
    	adcs	r0,r3			// ac + 0b + C
    
    	mov	r4,r12

Children
No data