Background I'm working part-time on a Cortex M0+ based SoC converting a very processor-intensive section of C++ code (inner-loop executed 10s of 1000s of times a second & compiles to over 400 instructions using GNU O3) and after almost 3 months of work I have realized that using r13 (SP) as an additional address register OR using a lot of self-modifying code is going to be needed to yield the efficiency that is vital to the whole project. I have sketched out both methods and if I can use r13 (SP) then the inner-loop is 299 instructions including 12 memory accesses which IS fast enough but the problem with SMC is that I have to place literal pools within the code and branch around them and of course it increases the number of memory accesses and of course is more difficult to debug.Problem I have read through the 'Cortex-M0+ Devices Generic User Guide' and I'm still unclear if the processor can be set up to use the MSP for the main code execution and the PSP for the interrupts/exceptions. No 'threading' as such is used, I'm a veteran 100% assembly language games programmer so it's process & interrupts. Maybe i'm just too out of date or too stupid but the intention is to convert around 15000 lines of C++ into pure assembly language since the intention is to use the fewest resources possible.Diagnosis 17 years of not coding is obviously a big hindrance but happily, the sheer necessity to shave off cycles does get one's mind back into gear. PrognosisThumb is an unusual RISC instruction set and I think that I've programmed over a dozen 'in anger' and tricks do pop up but the learning curve is quite steep. I'm keen to use r13 if I can because it JUST fits i.e. no ugly moving stuff to & from hi registers (that ADD Hi to Low, Low to High & High to High is surprisingly powerful) because I need to store up to 8 variables at any one time and register use is thus:r0-r4 - corrupted by 32-bit x 32-bit --> 64-bit macro (17 cycles)r5-r6 - storager7 - destination base (8 ints are calculated)r8-r12 - storager13 - source base (6 ints are used)r14 - storageI don't know if people still love the sight of elegant assembly language but the solution does LOOK very tidy with no recourse to clumsy instruction/cycle wasting to get around the fact that for a RISC core, it isn't as orthogonal as most. I read that the designers looked at the SH2 and I've written tens of thousands of lines of that particular flavour for various Sega platforms which might be of some help to me.
PS Interesting to see that things like the Huffman encoding has been designed with custom silicon in mind. Why not simply use a hardware solution? Because I want to support ACELP and potentially allow upgrades to add things like RCELP & other highly asymmetrical encode/decode complexity codecs when their patents run out. At all times cost & power consumption are key. Yes, a Fitbit retailing at £100 can simply use an M4 but my target is <$5.
BTW DOI: 10.1109/CCECE.2008.4564625 'REAL TIME IMPLEMENTATION AND OPTMIZATION OF MP3 DECODER ON DSP' by Benix Samuel, Ashok Jhunjhunwala (Indian Institute of Technology Madras, India) is a very useful document. I was relieved to discover that their profile of CPU bandwidth matched my own. Being based on the Blackfin DSP, it only supports 16-bit x 16-bit MUL & MAC thus yields similar performance to an M0+. a complexity of 24.5 MIPS, so JUST possible. 23.85 KBs ACELP is listed as <40 MIPS but that uses an ungainly method to add the higher frequencies. In that case, the speed at which the tables are generated are the limiting factor and AFAIK their is no DSP support for the maths involved thus a processor like the M0+ will do well, especially given that encode/decode complexity is highly asymmetrical.