Background I'm working part-time on a Cortex M0+ based SoC converting a very processor-intensive section of C++ code (inner-loop executed 10s of 1000s of times a second & compiles to over 400 instructions using GNU O3) and after almost 3 months of work I have realized that using r13 (SP) as an additional address register OR using a lot of self-modifying code is going to be needed to yield the efficiency that is vital to the whole project. I have sketched out both methods and if I can use r13 (SP) then the inner-loop is 299 instructions including 12 memory accesses which IS fast enough but the problem with SMC is that I have to place literal pools within the code and branch around them and of course it increases the number of memory accesses and of course is more difficult to debug.Problem I have read through the 'Cortex-M0+ Devices Generic User Guide' and I'm still unclear if the processor can be set up to use the MSP for the main code execution and the PSP for the interrupts/exceptions. No 'threading' as such is used, I'm a veteran 100% assembly language games programmer so it's process & interrupts. Maybe i'm just too out of date or too stupid but the intention is to convert around 15000 lines of C++ into pure assembly language since the intention is to use the fewest resources possible.Diagnosis 17 years of not coding is obviously a big hindrance but happily, the sheer necessity to shave off cycles does get one's mind back into gear. PrognosisThumb is an unusual RISC instruction set and I think that I've programmed over a dozen 'in anger' and tricks do pop up but the learning curve is quite steep. I'm keen to use r13 if I can because it JUST fits i.e. no ugly moving stuff to & from hi registers (that ADD Hi to Low, Low to High & High to High is surprisingly powerful) because I need to store up to 8 variables at any one time and register use is thus:r0-r4 - corrupted by 32-bit x 32-bit --> 64-bit macro (17 cycles)r5-r6 - storager7 - destination base (8 ints are calculated)r8-r12 - storager13 - source base (6 ints are used)r14 - storageI don't know if people still love the sight of elegant assembly language but the solution does LOOK very tidy with no recourse to clumsy instruction/cycle wasting to get around the fact that for a RISC core, it isn't as orthogonal as most. I read that the designers looked at the SH2 and I've written tens of thousands of lines of that particular flavour for various Sega platforms which might be of some help to me.
Unfortunately you cannot use SP for general data storage. When an interrupt take place, the stacking of R0-R3, R12, return address, LR and PSR are done using current selected stack pointer as stack location. So even you switch over to use PSP in thread, the stack pointer used for stacking (from thread to handler) is PSP, and MSP is used when the handlers are executing.
However, you can use LR as temporary storage if that helps.
Heck. Well, I'm using SP as the source pointer (*cptr) but if I swap them then, can I simply use, for example:LDR R0,[R13,#1024]LDR R1,[R13,#1020]LDR R2,[R13,#1016]and so on and simply put the stack below the buffer? I appreciate that I could simply use stack-frame & copy the 32 ints (4 loops carried out on 8 int groups) using the fast LDM!/STR! instructions but I'm obviously looking to use the fewest resources possible and the interrupts won't use much more stack. Right now, cycles are the most scarce resource and since an int to short conversion to an output buffer is required, I can use that space below. It may mean 2 copies of this routine which is about 1K (inner-loop with 299 instructions!)As I'm sure you can guess, in the C++ version, dest = *cptr++; is used but Thumb has the powerful <reg>,#<imm> addressing mode. As it is, the workspace is on a 128 byte boundary so instead of using a loop-counter, I update(d?) the pointer and used an LSRS Rd,R13,#7 to check if it's completed 4 passes...I will rewrite again and see if I can do this AND remove a few more cycles.Many thanks J!
Sean Dunlevy said:Heck. Well, I'm using SP as the source pointer (*cptr) but if I swap them then, can I simply use, for example:LDR R0,[R13,#1024]LDR R1,[R13,#1020]LDR R2,[R13,#1016]and so on and simply put the stack below the buffer?
Sean, this idea is brilliant!
Unless you can disable interrupts during the subroutine. But then, locking/unlocking is also costly.
There might not be any need to disable interrupt as the interrupt handlers will use memory space below the current stack pointer.
This method can work, but trying to setup the buffer allocation inside the stack could be tricky as this is normally handled by the toolchain - in the code fragment you used fixed offset, but the offset is not known until the linker placed the data into stack allocation.
Joseph, he write pure 100% native assembly. So I guess it's all in his hands ;-)
Very true :-)
Thank you all for all of your input. It is my own fault - PRESUMPTION! Since this project is going to have very little head-room, I'm going to use the stack-frame in the manner I mentioned. I have worked out a better way to use self-modifying code (changes the #<imm> offsets rather than literal pool) BUT it then limits the code to sitting in RAM and obviously, I don't want to presume any more. I have developed a plan of attack for the larger problem (the entire decoder in Thumb). I'm using the C++ code on an M3 based SoC and routine by routine, I will convert the code into Thumb. That means I can debug smaller chunks of code more easily. There is clearly sufficient head-room for the SDRAM DMAs and so forth so it makes things a lot easier. Bastian, glad to hear from you, you're a diamond. J - thanks as always and I will press on and see if we can get this working.Sean
By the way, one thing about Armv6-M and Armv8-M Baseline
>LDR R0,[R13,#1024] <- offset too large ;-)>LDR R1,[R13,#1020]>LDR R2,[R13,#1016]
I think the maximum offset value is 1020 bytes. For Armv7-M and Armv8-M Mainline the offset can be much larger.
Sorry J, as you say, the Thumb v6 Quick Reference gives a 'imm range of 0-124 in multiples of 4'. Not a problem. I'm going to simply use a 32 int stackframe and copy in & out the data. Considering the current number of cycles expended by the 2 passes & filtering of each 32 sample block takes 3100 cycles, I can manage this.I've realized I need to produce the most general solution possible thus no SMC and the smallest RAM footprint possible. Bastian suggested I unroll the first pass ages ago thus 8 lines of code that use an 18 instruction macro unrolls to 762 lines of assembly language.Do you think I should retain MP1 & MP2 functionality? I'm going to retain the stereo options even though a stock M0+ isn't likely to produce high quality music as well as the 32/44.1/48 KHz playback options since it doesn't take much more space BUT the MP1 & MP2 is a significant chunk...
Given that now MP3 encoder software is widely available I don't think there is a strong need for MP1 and MP2. If there is an audio file that was in MP1/MP2, converting that to MP3 on a PC before upload shouldn't be a big issue. And as you know the flash memory space might not be big enough to support multiple formats anyway.
Joseph Yiu said:Given that now MP3 encoder software is widely available I don't think there is a strong need for MP1
The question is not encoding, but decoding. But I do not know the technique behind MP1/MP2 coding.I also think, if the encoder knows about the drawbacks of the decoder, it can produce a file which suits the decoder best.
Sorry for not being clear. Let me explain.
MP3 decoding on microcontrollers has been around for many years. Since Cortex-M3 was released there were already people has been using Cortex-M3 for MP3 decoding for low end audio devices. MP3 encoding, however, is more challenging for older microcontrollers and therefore those audio devices supports both MP2 and MP3, as audio recording can only be performed with MP2 and as a result those devices support both audio formats, so that it can playback recorded audio.
Some ancient PC based audio software also only support MP2 encoding. (I am referring to those really old ones... yes, I am that old). So audio devices that using the recorded audio need to support MP2 playback. Now you can get free audio editing software like Audacity which support all major formats including MP3.
In Sean's project (from my understanding) it is only playback and no recording. Given that you can use free software to convert audio into MP3 before downloading to the playback device, I don't see the need to support MP1/MP2 decoding.
Hope this explained my view.
I appreciate everyone's interest. It's certainly proving to be an interesting project because it really does allow the M0+ to operate at virtually 1 MIP/MHz. I am presume values >1 seen in some documents refer to compilers removing dead instructions since the M0+ CPU doesn't appear to have any asynchronous instructions. IF I include low bandwidth encoding, it won't be in real time. I'm mulling over the use of ADPCM for real time encoding and the device can then decode blocks to be encoded in MP3 or ACELP format i.e. . Mainly I'm thinking of teachers. The key thought is to find a model that does not require an established infrastructure. A simple, low-cost, low-energy device that brings audiobooks to everyone on earth. It's a small thing, but it empowers people. I should also add that I think the IoT revolution needs it's 'killer app' and considering the cost & power consumption of the M0+, MP3 & ACELP i.e. low cost audio is potentially that app. I want something that makes the M0+ THE baseline CPU for the (powered) IoT because then we can proceed without recourse to dirty solutions like the hideous Javacard. I'm hoping that PragmatIC can produce very cheap ROMs that can upload using Bluetooth energy harvesting i.e. back-scatter. If a library of audiobooks doesn't even need power, it's available where most everything else is not.Sean
PS Interesting to see that things like the Huffman encoding has been designed with custom silicon in mind. Why not simply use a hardware solution? Because I want to support ACELP and potentially allow upgrades to add things like RCELP & other highly asymmetrical encode/decode complexity codecs when their patents run out. At all times cost & power consumption are key. Yes, a Fitbit retailing at £100 can simply use an M4 but my target is <$5.
BTW DOI: 10.1109/CCECE.2008.4564625 'REAL TIME IMPLEMENTATION AND OPTMIZATION OF MP3 DECODER ON DSP' by Benix Samuel, Ashok Jhunjhunwala (Indian Institute of Technology Madras, India) is a very useful document. I was relieved to discover that their profile of CPU bandwidth matched my own. Being based on the Blackfin DSP, it only supports 16-bit x 16-bit MUL & MAC thus yields similar performance to an M0+. a complexity of 24.5 MIPS, so JUST possible. 23.85 KBs ACELP is listed as <40 MIPS but that uses an ungainly method to add the higher frequencies. In that case, the speed at which the tables are generated are the limiting factor and AFAIK their is no DSP support for the maths involved thus a processor like the M0+ will do well, especially given that encode/decode complexity is highly asymmetrical.