I am a big fan of M0+ but I'm always running into the problem that only ADD,CMP & MOV instructions are available to hi registers with the exception of forming a stack frame. While access to said frame take 2 cycles instead of 1, if a reasonable number of variables are needed, SP is the only way to go. I have spent ages trying to find an efficient assembly version of this:void (void FDCT32 (int *buf, int *dest, int offset, int oddBlock, int gb)0{ a0 = buf[i]; a3 = buf[31-i]; b0 = a0 + a3; b3 = MULSHIFT32(*cptr++, a0 - a3) << (s0);
a1 = buf[15-i]; a2 = buf[16+i]; b1 = a1 + a2; b2 = MULSHIFT32(*cptr++, a1 - a2) << (s1);
buf[i] = b0 + b1; buf[15-i] = MULSHIFT32(*cptr, b0 - b1) << (s2);
buf[16+i] = b2 + b3; buf[31-i] = MULSHIFT32(*cptr++, b3 - b2) << (s2);}
Now MULSHIFT 32 performs a 32-bit x 32-bit multiply but only the top 32-bits of the result are needed, Sadly, it looks like such an algorithm is no faster (unless someone knows one).In the above you will note that I need to pointer registers i.e. *buf & cptr++. I DID consider using the SP as the base-address of cpr++ but even going to those lengths, I find myself running out of registers. I might add that R0-R4 are used by MULSHIFT32 so while I can use them between MULSHIFT instructions, I'm only left with r5,r6 & r7 and at least one of them needs to be a pointer.Do others simply define a stack-frame as manipulating Lo-Hi,Hi-Lo means that it's no faster.I would LIKE to store both the calues 31,15 & the 3 shift values (5 bit) into a single register but once again, it uses a low register.I HAVE spent a lot of time on this as I believe a technique to deal with this snippet will answer questions/problems seen all through the code.Many thanks.