M0+ Stack Pointer (PSP/MSP) Clarification

Background

 I'm working part-time on a Cortex M0+ based SoC converting a very processor-intensive section of C++ code (inner-loop executed 10s of 1000s of times a second & compiles to over 400 instructions using GNU O3) and after almost 3 months of work I have realized that using r13 (SP) as an additional address register OR using a lot of self-modifying code is going to be needed to yield the efficiency that is vital to the whole project. I have sketched out both methods and if I can use r13 (SP) then the inner-loop is 299 instructions including 12 memory accesses which IS fast enough but the problem with SMC is that I have to place literal pools within the code and branch around them and of course it increases the number of memory accesses and of course is more difficult to debug.

Problem

 I have read through the 'Cortex-M0+ Devices  Generic User Guide' and I'm still unclear if the processor can be set up to use the MSP for the main code execution and the PSP for the interrupts/exceptions. No 'threading' as such is used, I'm a veteran 100% assembly language games programmer so it's process & interrupts. Maybe i'm just too out of date or too stupid but the intention is to convert around 15000 lines of C++ into pure assembly language since the intention is to use the fewest resources possible.

Diagnosis

 17 years of not coding is obviously a big hindrance but happily, the sheer necessity to shave off cycles does get one's mind back into gear.

Prognosis

Thumb is an unusual RISC instruction set and I think that I've programmed over a dozen 'in anger' and tricks do pop up but the learning curve is quite steep. I'm keen to use r13 if I can because it JUST fits i.e. no ugly moving stuff to & from hi registers (that ADD Hi to Low, Low to High & High to High is surprisingly powerful) because I need to store up to 8 variables at any one time and register use is thus:

r0-r4 - corrupted by 32-bit x 32-bit --> 64-bit macro (17 cycles)
r5-r6 - storage
r7 - destination base (8 ints are calculated)
r8-r12 - storage
r13 - source base (6 ints are used)
r14 - storage

I don't know if people still love the sight of elegant assembly language but the solution does LOOK very tidy with no recourse to clumsy instruction/cycle wasting to get around the fact that for a RISC core, it isn't as orthogonal as most. I read that the designers looked at the SH2 and I've written tens of thousands of lines of that particular flavour for various Sega platforms which might be of some help to me.