Background I'm working part-time on a Cortex M0+ based SoC converting a very processor-intensive section of C++ code (inner-loop executed 10s of 1000s of times a second & compiles to over 400 instructions using GNU O3) and after almost 3 months of work I have realized that using r13 (SP) as an additional address register OR using a lot of self-modifying code is going to be needed to yield the efficiency that is vital to the whole project. I have sketched out both methods and if I can use r13 (SP) then the inner-loop is 299 instructions including 12 memory accesses which IS fast enough but the problem with SMC is that I have to place literal pools within the code and branch around them and of course it increases the number of memory accesses and of course is more difficult to debug.Problem I have read through the 'Cortex-M0+ Devices Generic User Guide' and I'm still unclear if the processor can be set up to use the MSP for the main code execution and the PSP for the interrupts/exceptions. No 'threading' as such is used, I'm a veteran 100% assembly language games programmer so it's process & interrupts. Maybe i'm just too out of date or too stupid but the intention is to convert around 15000 lines of C++ into pure assembly language since the intention is to use the fewest resources possible.Diagnosis 17 years of not coding is obviously a big hindrance but happily, the sheer necessity to shave off cycles does get one's mind back into gear. PrognosisThumb is an unusual RISC instruction set and I think that I've programmed over a dozen 'in anger' and tricks do pop up but the learning curve is quite steep. I'm keen to use r13 if I can because it JUST fits i.e. no ugly moving stuff to & from hi registers (that ADD Hi to Low, Low to High & High to High is surprisingly powerful) because I need to store up to 8 variables at any one time and register use is thus:r0-r4 - corrupted by 32-bit x 32-bit --> 64-bit macro (17 cycles)r5-r6 - storager7 - destination base (8 ints are calculated)r8-r12 - storager13 - source base (6 ints are used)r14 - storageI don't know if people still love the sight of elegant assembly language but the solution does LOOK very tidy with no recourse to clumsy instruction/cycle wasting to get around the fact that for a RISC core, it isn't as orthogonal as most. I read that the designers looked at the SH2 and I've written tens of thousands of lines of that particular flavour for various Sega platforms which might be of some help to me.
I appreciate everyone's interest. It's certainly proving to be an interesting project because it really does allow the M0+ to operate at virtually 1 MIP/MHz. I am presume values >1 seen in some documents refer to compilers removing dead instructions since the M0+ CPU doesn't appear to have any asynchronous instructions. IF I include low bandwidth encoding, it won't be in real time. I'm mulling over the use of ADPCM for real time encoding and the device can then decode blocks to be encoded in MP3 or ACELP format i.e. . Mainly I'm thinking of teachers. The key thought is to find a model that does not require an established infrastructure. A simple, low-cost, low-energy device that brings audiobooks to everyone on earth. It's a small thing, but it empowers people. I should also add that I think the IoT revolution needs it's 'killer app' and considering the cost & power consumption of the M0+, MP3 & ACELP i.e. low cost audio is potentially that app. I want something that makes the M0+ THE baseline CPU for the (powered) IoT because then we can proceed without recourse to dirty solutions like the hideous Javacard. I'm hoping that PragmatIC can produce very cheap ROMs that can upload using Bluetooth energy harvesting i.e. back-scatter. If a library of audiobooks doesn't even need power, it's available where most everything else is not.Sean
PS Interesting to see that things like the Huffman encoding has been designed with custom silicon in mind. Why not simply use a hardware solution? Because I want to support ACELP and potentially allow upgrades to add things like RCELP & other highly asymmetrical encode/decode complexity codecs when their patents run out. At all times cost & power consumption are key. Yes, a Fitbit retailing at £100 can simply use an M4 but my target is <$5.
BTW DOI: 10.1109/CCECE.2008.4564625 'REAL TIME IMPLEMENTATION AND OPTMIZATION OF MP3 DECODER ON DSP' by Benix Samuel, Ashok Jhunjhunwala (Indian Institute of Technology Madras, India) is a very useful document. I was relieved to discover that their profile of CPU bandwidth matched my own. Being based on the Blackfin DSP, it only supports 16-bit x 16-bit MUL & MAC thus yields similar performance to an M0+. a complexity of 24.5 MIPS, so JUST possible. 23.85 KBs ACELP is listed as <40 MIPS but that uses an ungainly method to add the higher frequencies. In that case, the speed at which the tables are generated are the limiting factor and AFAIK their is no DSP support for the maths involved thus a processor like the M0+ will do well, especially given that encode/decode complexity is highly asymmetrical.