Hi - I have spent several months optimising my MP2/MP2.5/MP3 decoder on my Raspberry Pi pico. Profiling highlighted the fact that of the 19 files containing code, it was 5 small files that take most of the time. i mean, it's polyphase filter an so forth so no surprise. I'm running the MP3 decoder from an interrupt so i stack LR and store SP to memory since the SP is a very powerful instruction (all of those addressing modes and so on) but I keep running into the same issue.The MULSHIFT32 macro is used thousands of times throughout the code and as the name suggests, it multiplies two 32-bit values together and returns the most top 32 bits of the result. Just to give an example, the polyphase loop takes 417 lines IF MULH was a valid instruction taking 409 to 421 cycles depending on branches BUT if I use the macro which some really excellent coder on here managed to achieve it in 17 cycles (I forget his name but he is awesome) it takes 627 cycles.That means that as it is, it can play a high quality mono stream with the clock speed set at 64MHz but my calculations suggest that if I had MULH, it would manage with the clock at 48MHz whixh would save power.I know exactly nothing about microcode in processors but I am aware of the Cortex M1 but it seems that it can only perform the slow (33 cycles in this case) and many moons ago I did read about the ARM7EJ-S which sounded fascinating.Now, the aim is to produce a USB memory stick that also uses the cortex M0+ processors found in 95% of these sticks to not only support MP2/2.5/3 (and possibly AAC which looks more complex to encode but no more complex to decode) as well as ACELP (the patent just an out for MP3 & ACELP as well as 1,2,3,4,5-bit ADPCM (I suppose 1 bit is technically delta compression?) and LPC10 (LPC10e is still under patent) because a good friend showed me some encoding tricks for LPC and considering that it was using 300,600 & 1200 bits per second, the quality was great - certainly good enough not to be annoying and I would hope sufficiently good for it's use in audiobooks aimed at education.I do apologise if this is just a stupid question but I am wondering how much extra silicon adding a MULH would take (Yes, I know I'm dumb).On the plus side, the reference fixed-point MP3 decoder I am using IS in C but interestingly, the code has been written in a manner which presumes sixteen 32-bit registers, one of which is the SP. When I use these tricks to get back r13 & r14 I have been able to avoid needing to use a stack-fram or a buffer in RAM (obviously since I use SP, it's a dang good thing I DO NOT need a stack frame.Many, many people have helped me and while my memory is terrible, you know who you are and although illness has slowed me down (seizures), I am getting it together. I am honsestly wondering just how much adding a MULH to the instruction set would cost in silicon because it really does knock of ⅓ of the processing time.Of course, Naveed has been a constant source of inspiration and I've just ordered some more audio stuff and an OLED screen so I intend to get some great sound from a humble USB stick.
I might add that idct9.c and MidSideProc.c also use huge numbers of MULSHIFT32 macros. With the help of the people here I have been able to produce a macro that finds the top 32-bits of a 64-bit result so the long and the short of it is that either I have to run the processor very quickly (like 166 MHz) which is certainly not conducive to a long battery life (which is a cornerstone of the project).Put simply,MULSHIFT32 performs a 32-bit x 32-bit --->64 bit multiply but only the top 32 bits are needed.I admit that I am not well up on this synthesizable area of development as for most of my career I was writing games for consoles but I FEEL so close. I mean, I do not know about CPU development but does anyone with knowledge of hardware design have any input on this one? I've been staring at it for almost a year and frankly it is driving me insane.I have a basic understanding of how a processor performs a multiply. The 1-cycle variant that I need to use forms something along the lines of a 'ripple multiplier' in which all 32 steps of the multiply are carried out in a single clock cycle. I have spent some time reading the appropriate patents covering the issue and I note that minimum conformation for the M0 uses around 12000 transistors but with all options including 1 cycle multiply, it's more like 25000.Eveb so, it's such a tiny piece of silicon even when produced using a 40nm process is 0.008 mm2 so if the maker is using a 200mm wafer (most common at the moment, then each wafer contains thousands of processors and with such small chips, the % lost to manufacturing errors must be a tiny %While I do appreciate that cost is important, it's also worth noting that the M0 is actually a very powerful processor indeed. The quoted figure is 0.89 MIPS/MHz and the M0+ figure is 0.95 MIPS/MHz but having written tens of thousands of highly optimised Thumb, those figures are closer to 0.92 and 0.97. In fact, when it comes down to the math behind an audio drive that is adding 32 frequency ranges to the total output, a figure of 0.98 is not at all unrealistic.But that is where i am. I would really appreciate any ideas that people might have. It's very depressing when you can see tens of thousands of lines of optimised assembly language but knowing that a 17 instruction piece of code used throughout the code is the limiting factor.The processor contains a ripple multiplier so it isn't the case that ARM would have to gut the processor and start again,Sorry to be doomy and gloomy but I am hoping and praying that someone associated with this element will realise just how much power it unleashes.
Sean,
instead of thinking about pimping an M0+, what about a low-power CM3? It would make your live so much easier with all the DSP instructions, for examle "SMULL".I mean, even if it takes a bit more power the net result might be positive as you can run it at a 10th of the speed as the CM0+.
You are right, Bastian, but ARM is a fabless CP/DSP designer and whoever builds the final chip based on an ARM design pays a licence. As I understand it, an M0+ licence is by far the cheapest of all. As you know, it's a non-profit project for developing nations and so keeping that cost down is key.I wish I could find the ARM patent on their ripple multiply (1 cycle) because I would love to know IF they produced a version with a 64 bit result (would need 24000+ gates) or if they were smart enough to develop one that acts as a MULH. After all, it manages a 2-bit x 30-bit multiply with aplomb so it would merely have to keep a 48-bit intermediate which is added at the end,The problem is that ARM patents under 5 names so finding the patent is a real challenge.After all B, you managed to beat ARMs own compiler by 5 cycles which is a HUGE improvement! I j want to keep that gate count down. If a MULH means 17000 rather than 16000, it's still tiny and uses a tiny amount of power.
Ah, forgot you want to build a all-in-one chip. Would be cool if there would be ASICs with CM3 hard-macros where you would not have to pay the full license (as it is paid by the Fab) but only per chip.Or: ARM would provide the RTL for the SMULL so it could be "added" to an CM0+ :-)Anyway, despite the time you already spent, would RISC-V be an option? (Honestly I did not bother to look into the ISA yet, as I do not see any projects with it.)