Background I'm working part-time on a Cortex M0+ based SoC converting a very processor-intensive section of C++ code (inner-loop executed 10s of 1000s of times a second & compiles to over 400 instructions using GNU O3) and after almost 3 months of work I have realized that using r13 (SP) as an additional address register OR using a lot of self-modifying code is going to be needed to yield the efficiency that is vital to the whole project. I have sketched out both methods and if I can use r13 (SP) then the inner-loop is 299 instructions including 12 memory accesses which IS fast enough but the problem with SMC is that I have to place literal pools within the code and branch around them and of course it increases the number of memory accesses and of course is more difficult to debug.Problem I have read through the 'Cortex-M0+ Devices Generic User Guide' and I'm still unclear if the processor can be set up to use the MSP for the main code execution and the PSP for the interrupts/exceptions. No 'threading' as such is used, I'm a veteran 100% assembly language games programmer so it's process & interrupts. Maybe i'm just too out of date or too stupid but the intention is to convert around 15000 lines of C++ into pure assembly language since the intention is to use the fewest resources possible.Diagnosis 17 years of not coding is obviously a big hindrance but happily, the sheer necessity to shave off cycles does get one's mind back into gear. PrognosisThumb is an unusual RISC instruction set and I think that I've programmed over a dozen 'in anger' and tricks do pop up but the learning curve is quite steep. I'm keen to use r13 if I can because it JUST fits i.e. no ugly moving stuff to & from hi registers (that ADD Hi to Low, Low to High & High to High is surprisingly powerful) because I need to store up to 8 variables at any one time and register use is thus:r0-r4 - corrupted by 32-bit x 32-bit --> 64-bit macro (17 cycles)r5-r6 - storager7 - destination base (8 ints are calculated)r8-r12 - storager13 - source base (6 ints are used)r14 - storageI don't know if people still love the sight of elegant assembly language but the solution does LOOK very tidy with no recourse to clumsy instruction/cycle wasting to get around the fact that for a RISC core, it isn't as orthogonal as most. I read that the designers looked at the SH2 and I've written tens of thousands of lines of that particular flavour for various Sega platforms which might be of some help to me.
Joseph, he write pure 100% native assembly. So I guess it's all in his hands ;-)
Very true :-)
Thank you all for all of your input. It is my own fault - PRESUMPTION! Since this project is going to have very little head-room, I'm going to use the stack-frame in the manner I mentioned. I have worked out a better way to use self-modifying code (changes the #<imm> offsets rather than literal pool) BUT it then limits the code to sitting in RAM and obviously, I don't want to presume any more. I have developed a plan of attack for the larger problem (the entire decoder in Thumb). I'm using the C++ code on an M3 based SoC and routine by routine, I will convert the code into Thumb. That means I can debug smaller chunks of code more easily. There is clearly sufficient head-room for the SDRAM DMAs and so forth so it makes things a lot easier. Bastian, glad to hear from you, you're a diamond. J - thanks as always and I will press on and see if we can get this working.Sean
You're welcome.
By the way, one thing about Armv6-M and Armv8-M Baseline
>LDR R0,[R13,#1024] <- offset too large ;-)>LDR R1,[R13,#1020]>LDR R2,[R13,#1016]
I think the maximum offset value is 1020 bytes. For Armv7-M and Armv8-M Mainline the offset can be much larger.
Sorry J, as you say, the Thumb v6 Quick Reference gives a 'imm range of 0-124 in multiples of 4'. Not a problem. I'm going to simply use a 32 int stackframe and copy in & out the data. Considering the current number of cycles expended by the 2 passes & filtering of each 32 sample block takes 3100 cycles, I can manage this.I've realized I need to produce the most general solution possible thus no SMC and the smallest RAM footprint possible. Bastian suggested I unroll the first pass ages ago thus 8 lines of code that use an 18 instruction macro unrolls to 762 lines of assembly language.Do you think I should retain MP1 & MP2 functionality? I'm going to retain the stereo options even though a stock M0+ isn't likely to produce high quality music as well as the 32/44.1/48 KHz playback options since it doesn't take much more space BUT the MP1 & MP2 is a significant chunk...
HI Sean,
Given that now MP3 encoder software is widely available I don't think there is a strong need for MP1 and MP2. If there is an audio file that was in MP1/MP2, converting that to MP3 on a PC before upload shouldn't be a big issue. And as you know the flash memory space might not be big enough to support multiple formats anyway.
regards,
Joseph
Joseph Yiu said:Given that now MP3 encoder software is widely available I don't think there is a strong need for MP1
The question is not encoding, but decoding. But I do not know the technique behind MP1/MP2 coding.I also think, if the encoder knows about the drawbacks of the decoder, it can produce a file which suits the decoder best.
Hi Bastian,
Sorry for not being clear. Let me explain.
MP3 decoding on microcontrollers has been around for many years. Since Cortex-M3 was released there were already people has been using Cortex-M3 for MP3 decoding for low end audio devices. MP3 encoding, however, is more challenging for older microcontrollers and therefore those audio devices supports both MP2 and MP3, as audio recording can only be performed with MP2 and as a result those devices support both audio formats, so that it can playback recorded audio.
Some ancient PC based audio software also only support MP2 encoding. (I am referring to those really old ones... yes, I am that old). So audio devices that using the recorded audio need to support MP2 playback. Now you can get free audio editing software like Audacity which support all major formats including MP3.
In Sean's project (from my understanding) it is only playback and no recording. Given that you can use free software to convert audio into MP3 before downloading to the playback device, I don't see the need to support MP1/MP2 decoding.
Hope this explained my view.
I appreciate everyone's interest. It's certainly proving to be an interesting project because it really does allow the M0+ to operate at virtually 1 MIP/MHz. I am presume values >1 seen in some documents refer to compilers removing dead instructions since the M0+ CPU doesn't appear to have any asynchronous instructions. IF I include low bandwidth encoding, it won't be in real time. I'm mulling over the use of ADPCM for real time encoding and the device can then decode blocks to be encoded in MP3 or ACELP format i.e. . Mainly I'm thinking of teachers. The key thought is to find a model that does not require an established infrastructure. A simple, low-cost, low-energy device that brings audiobooks to everyone on earth. It's a small thing, but it empowers people. I should also add that I think the IoT revolution needs it's 'killer app' and considering the cost & power consumption of the M0+, MP3 & ACELP i.e. low cost audio is potentially that app. I want something that makes the M0+ THE baseline CPU for the (powered) IoT because then we can proceed without recourse to dirty solutions like the hideous Javacard. I'm hoping that PragmatIC can produce very cheap ROMs that can upload using Bluetooth energy harvesting i.e. back-scatter. If a library of audiobooks doesn't even need power, it's available where most everything else is not.Sean
PS Interesting to see that things like the Huffman encoding has been designed with custom silicon in mind. Why not simply use a hardware solution? Because I want to support ACELP and potentially allow upgrades to add things like RCELP & other highly asymmetrical encode/decode complexity codecs when their patents run out. At all times cost & power consumption are key. Yes, a Fitbit retailing at £100 can simply use an M4 but my target is <$5.
BTW DOI: 10.1109/CCECE.2008.4564625 'REAL TIME IMPLEMENTATION AND OPTMIZATION OF MP3 DECODER ON DSP' by Benix Samuel, Ashok Jhunjhunwala (Indian Institute of Technology Madras, India) is a very useful document. I was relieved to discover that their profile of CPU bandwidth matched my own. Being based on the Blackfin DSP, it only supports 16-bit x 16-bit MUL & MAC thus yields similar performance to an M0+. a complexity of 24.5 MIPS, so JUST possible. 23.85 KBs ACELP is listed as <40 MIPS but that uses an ungainly method to add the higher frequencies. In that case, the speed at which the tables are generated are the limiting factor and AFAIK their is no DSP support for the maths involved thus a processor like the M0+ will do well, especially given that encode/decode complexity is highly asymmetrical.