Converting C into M0+

I am a big fan of M0+ but I'm always running into the problem that only ADD,CMP & MOV instructions are available to hi registers with the exception of forming a stack frame. While access to said frame take 2 cycles instead of 1, if a reasonable number of variables are needed, SP is the only way to go. I have spent ages trying to find an efficient assembly version of this:
void (

void FDCT32 (int *buf, int *dest, int offset, int oddBlock, int gb)0
a0 = buf[i];
a3 = buf[31-i];
b0 = a0 + a3;
b3 = MULSHIFT32(*cptr++, a0 - a3) << (s0);

a1 = buf[15-i];
a2 = buf[16+i];
b1 = a1 + a2;
b2 = MULSHIFT32(*cptr++, a1 - a2) << (s1);

buf[i] = b0 + b1;
buf[15-i] = MULSHIFT32(*cptr, b0 - b1) << (s2);

buf[16+i] = b2 + b3;
buf[31-i] = MULSHIFT32(*cptr++, b3 - b2) << (s2);

Now MULSHIFT 32 performs a 32-bit x 32-bit multiply but only the top 32-bits of the result are needed, Sadly, it looks like such an algorithm is no faster (unless someone knows one).

In the above you will note that I need to pointer registers i.e. *buf & cptr++. I DID consider using the SP as the base-address of cpr++ but even going to those lengths, I find myself running out of registers. I might add that R0-R4 are used by MULSHIFT32 so while I can use them between MULSHIFT instructions, I'm only left with r5,r6 & r7 and at least one of them needs to be a pointer.

Do others simply define a stack-frame as manipulating Lo-Hi,Hi-Lo means that it's no faster.

I would LIKE to store both the calues 31,15 & the 3 shift values (5 bit) into a single register but once again, it uses a low register.

I HAVE spent a lot of time on this as I believe a technique to deal with this snippet will answer questions/problems seen all through the code.

Many thanks.

More questions in this forum