Hi, I am developing a fixed-point MP3 & ACELP decoder on an Arduino Due. I realize that bit-banding makes a RMW sequence atomic but I notice fields in the Due hardware are multi-bit fields. Is there an atomic way to alter multiple bits? I realize that the atomic RMW can probably allow the applications/IRQ/NMI to resolve which gets access to the bus but if that is the only real use, wouldn't it have been simpler to use atomic test-and-set as the SuperH does?Examples like the PERIPH_PTSR (bits 0 & 8 used) are classic. Do you write to separate bytes, perform 2 bit-band operations or write to 16 or 32 bits at once?I do appreciate that this is very low level. I also note that TCM (tightly coupled memory) is mentioned just once and I'm not quite sure if this refers to part of the cache or a PSX-like scratch-pad. The PSX was another machine that needed hand-written assembly language to get the full performance out of it.Many thanks,Sean
Dear Bastian. My application requires 3 channels of DMA to run concurrently and given the number available even on the M0+ SoC, I double-buffered them because accurate timing is very important so I hoped to set off the next DMA as quickly as possible i.e. setting off the new interrupt before calculating and setting up the next one (set off in next interrupt). The FIFO size is 8 elements so I have a little time and RESx is only 1 bit but many of the others are fields. It's my own fault, I had presumed that ARM's Cortex designs would be as groundbreaking as the ARM7TDMI (for example). I had mistakenly thought they had done something like John E.Zolnowsky's (Motorola/SUN) work in the 1980s-1990s for MMUs that used memory block descriptor templates. It just put each field at the bottom of a 32-bit space.... but I am wrong.Many, many thanks.
Sean,
how many DMA channels do you have. Can't you setup one in the background for the next transfer and just activate it, when the previous has finished?BTW: IIRC some Atmel AT91SAMxxx could do what you want. But only for GPIOs.Maybe you chose the wrong SoC? ;-)
Hi Bastian, I am working on a very cost-sensitive project so I am limited to the specification of the SoC. In simple terms, it's a Hitachi Bluetooth processor that has an M4 with FP that is controlling everything. An odd idea at first glance but it's the cheapest solution. I come from a 100% assembly language games-coding background so I look at the theoretical bus bandwidth and allow myself 7.5% extra. Sometimes the most important skill is making a SoC do something nobody else can.BTW ARM7 is a beautiful language but Thumb is just awful. The C flag being inverted for some reason... but a reason that makes SBCS useless. About the most useful trick is the read multiple instructions that include IP so a jump table can set up R0-R7 in a single instruction... but so few optimizations.... but I will find them in the end.Can anyone explain why SBC(S) Rd := Rn – Operand2 – NOT(Carry) <---- why NOT carry?Has anyone played with ROR{S} Rd, Rm, Rs? The manuals state that the bottom 8 bits of Rs are used. Is this so C can be set if bits 5-7 are non-zero? It has been bugging me for ages. I appreciate your interest and thank you!
Sean Dunlevy said:Can anyone explain why SBC(S) Rd := Rn – Operand2 – NOT(Carry) <---- why NOT carry?
Maybe because it reverses ADCS.
Rd = Rn + Rm
Rx = Rd - Rm
=> Rx == Rn
Like the 6502 ;-)
Dear Bastian, The ADCS works correctly. If C=1 then an extra 1 is added. The SBCS doesn't do that. If C=0 then an extra 1 is subtractedNow I have been lead to believe that Thumb was designed to produce the smallest footprint for code compiled from C/C++ and now even that is being questioned. The instruction set is almost the bare minimum needed to do the job. Is their a reason why it would compile smaller C?The RORS is another puzzle. Can you think of a reason why the bottom 8 bits of a register value is used to rotate a second register when anything >32 appears to be pointless. As I surmised, maybe it's to set the flags but I haven't found any code that uses the RORS Rd,Rs instruction.I spent a lot of time optimizing certain M0 functionality such as the fastest 32-bit x 32-bit --->64-bit (after 5 different attempts I found that 17 cycles is the minimum) but I never worked out how to use the bottom 32-bits of the result you get from a MULS and had to use 4 MULS. In ARM v6 the MULS doesn't change the C & V flags (previous versions did) and I'm baffled as to why they would bother to change that. I keep thinking that somewhere in Texas, someone who developed the microcode has a fantastic plan but it either failed to happen or the functionality wasn't shared.
17cy for the multiply isn't bad
Sean Dunlevy said:I keep thinking that somewhere in Texas, someone who developed the microcode has a fantastic plan but it either failed to happen or the functionality wasn't shared.
Yes, I wonder, what the design goal for the M0 were. Maybe they had a maximum gate count. I don't think they use microcode at all, don't you. But actually, I had these thoughts on many architectures in the past (RX, V850, PowerPC VLE). I sometimes wonder if they ever ask assembly programmers about the instructions they want to implement. Concering RORS, I think I haven't used it at all in the past 19 years. Oh, twice in our kernel: "// no eRORS" ;-)
Dear Bastian, I was under the impression that Thumb was arrived at by analysing compiled C/C++. Cortex marked the beginning of it being the only instruction set. A few things seem odd. That only the MOVS & ADDS can operate on hi-registers seems pretty brutal. Especially since you can free up the LR & even the SP allowing 15 registers to be used.Power and gate-count were the critical things but as I have mentioned, why clean up the MULS? I actually intended to write an MP3 (64 kb/s mono) / ACELP (32 kb/s mono) decoder i.e. an Audible player using an M0+ (If a dedicated player is cheap enough, it would be of great benefit to schools for example). There aren't a huge number of CLZs needed so in fact, the only thing that stopped me was that I couldn't get a 32-bit x 32-bit --->64-bit multiply in under 17 cycles. I looked at many compiler outputs and started with a clean-sheet several times but it always ended up 17 cycles.I wish I knew who it was who updated that MULS because it FEELS like the bottom 32-bits should be usable as is but in every approach, 16-bit partials had to be used. What advantage does preserving the C & V flags while updating the N & Z flags have? D you know what I mean? It feels like there is a very good but unstated reason.From Tetris in 1985 to the Gameboy versions of Tomb Raider, I have worked exclusively in assembly language and in every other case I could understand (or at least make sense of) the designers thought processes and it's driving me mad. As I mentioned, there is actually quite a lot of tricks and the SP has unique addressing modes so I can get close to using 100% of the bus bandwidth but that MULS thing is driving me crazy. I see you have worked on many of the same CPUs as I have. The Virtual Boy (V850) and Sega 32X/Saturn (SH2) have pretty good instruction sets (and use the branch delay slots!) so I can only presume gate-count is the number 1 consideration with Cortex. I looked into fabrication process size of Cortex and they are using 10 year old sizes i.e. they are using production facilities that are out of date for most applications. I mean it's a really clever strategy. But that MULS. I feel like Sisyphus. I keep pushing the silicon up the hill only for it to roll back down again :-)Nice to talk to a real expert. I really value it.Sean
Sean Dunlevy said:that I couldn't get a 32-bit x 32-bit --->64-bit multiply in under 17 cycles. I looked at many compiler outputs and started with a clean-sheet several times but it always ended up 17 cycles.
Sean, I am trying all the week (ok, whenever I get some spare time), but don't get it with 17cycles. Only 18.
Are you sure, your version works for _all_ inputs, for example: 0x120000*0xffffffff ?
------------------------
Ok, got the 17cy version. :-) Thought also a 12cy version, but not for all inputs :(
Hi Bastian, That figure doesn't include PUSH/POP R4 or the BX LR. I'm just using the code that GNU outputs with the stuff removed for in-line assembly language. I agree that their may well be some faster methodology. I THINK that the Cortex (Core Texas) MULS is the one described in US Patent 7447726. The one thing that really makes it a pain is having to use 4 MULS when the bottom 32-bits of the result can be calculated using just 1 MULS. When only the top 32 bits of the result are needed, a fast way to do that would also be a big speedup.
Yes, left out saving r4 as well. No need to push though. You can move it to r12 which is also a scratch register, at least in the C ABI ;-)Regarding 3 muls: The problem is to find out the number of carries :( 0,1 or 2. But I fear finding those eats up the benefit.I also tried the Karatsuba method, but this is not working if one looses the upper 32bits :(Anyway, it was a nice brain-training.
Yeah - you can use r0-r14 if your code is bottom level. LR not needed & SP can be stored. Interrupts have own registers. If you look into Thumb's STRM & LDRM, it's convenient for switch instructions - multiple registers setup & new IP. There are also addressing modes unique to SP. I presume that is so C can Allocate & access RAM.Only today did I work out why LSLS/LSRS/ASRS/RORS Rd,Rs (powerful instructions) use Rs[7:0] rather than [R4:R0] i.e. shift can be 0-255. You can set up 256 numbers, many of them >16 bit so trick to set up a 32-bit value in 32 bits in total i.e.MOVS Rd,#<immediate>RORS Rd,Rdthe shift instructions can manage a few more. Don't forget that code in (E)(E)(P)ROM may read 32-bits to bus i.e. 2 instructions in 1 fetch. Some M0(+) SoCs also have tiny caches e.g. 64 bytes direct-mapped so setting up number with code in IC rather than odd DC reads might be a speed up.Now ARM has a long history of inclusion of barrel-shifters implicit in data fetches and I strongly suspect that the extra 3 bits consumed a tiny number (single digit) of transistors. I actually asked both Joseph Yiu & Steve Furber about both MULS and shift/rotate and neither remembers the exact reason but Mr Yiu pointed out that the SBCS uses NOT C to save transistors so the count must be tiny.I've always based my logic on instruction sets as being 'well, the designers must of done it this way for a good reason' and for the Cortex M0+ it was getting a 32-bit CPU into the same number of transistors as an 8-bit CPU. It is clunky, it is annoying and it has some things that make you want to punch the screen but there ARE tricks to be had.Going back to my 32-bit x 32-bit -->64-bit signed multiply and what is so annoying is...adds r0,r2
movs r2,#0 adcs r2,r2 lsls r2,r2The ONLY instructions that take notice of C are ADCS, SBCS and branches. Yes the M0+ is branch friendly but it's still slower than the 17 cycles the thing takes. I HAVE considered bit-stuffing using REV i.e. two 8-bit x 8-bit -->16-bit but it's still too much setup. You can imagine what a speed up 17-->15 cycles is worth.More or less the multiply & count-leading-zeros are the only 2 fragments needed for MP3 decoding. Joseph Yui was interested in finding out if an M0+ at ≈48 MHz can do. I'm also keen to try it out. 100% asm required ;-)
You can do in 12 cycles if you are sure about the input values (no carry from low to high word):
movs r2,r0 // ab lsrs r0,r0,#16 // a lsrs r3,r1,#16 // c movs r4,r0 // a muls r0,r3 // ac muls r3,r2 // ab*c muls r1,r2 // x muls r2,r4 // cd*a lsls r3,r3,#16 lsls r2,r2,#16 adds r2,r3 adcs r0,r2
After the 32-bit x 32-bit -->64-bit, the next two most important code-fragments are 32-bit x 32-bit --> top 32-bits of 64-bit result. Generally it's termed Mulshift32 in macros. The other important one is count leading-zeros. The former takes me 15 cycles, the latter 10 cycles by placing the 64-byte lookup (the De Bruijn sequence) on the Cortex M0's 'Zero Page' i.e. the bottom 256 bytes of memory can use an immediate to setup address and since the offset is defined by size of element read means that the bottom 1024 bytes can be thus accessed.Along with the CLZ tables, I've put the fixed-point SIN & COS lookup tables into that space. I DO have a tiny 64-byte Cache and while EEPROM can only be accessed at 24MHz (you have to include 1 wait), the SoC does try to read 32-bits at the same time so sometimes a little more code in exchange for keeping the thing from using lookups (that could thrash cache) is faster.It is more or less an experiment to see if a Cortex M0 can decode 32 kb/s ACELP in real time. That being the case, PragmatIC and others are now printing multi-layer ICs onto plastic (so CPU design is 3D) so if a breakfast cereal (for example) can have a printed CPU & ROM and use backscatter WiFi for power & coms then everything in the shop can talk.I presume that on such a large market, custom SoCs will be designed and built so all I am testing is just how little silicon (or in this case metal on plastic) can do the job. Hence 100% assembly language.I have stared at those 3 instructions needed to place the C bit into bit-16 of a register and it is driving me mad. Only ADCS & SBCS use C as input. Since SBCS Rd,Rn is actually Rd = Rd - Rs - NOT C (i.e. when C not set) so I realize that SBCS R2,R2 means R2 will contain either $00000000 (if C was set) or $ffffffff (if C was clear), it saves a MOVS R2,#0 but I cannot see how to use it...
15cycles multiplication? For _all_ input values? Ok, I need to do some more research.
Ha, this "zero page" idea is great. Too bad on some CM0, the first 128 bytes are used by the vector table.
Concerning, CLZ, I never came across deBruijn, only did it in binary search. But this takes from 16 to 32 cycles. A constant number of cycles (even less) is great ....
Regarding "SBCS Rn,Rn": I understand that this could one drive crazy, if you find a cool trick but no place to use it :-)To bad, most of those tricks can't be used in an RTOS, as most things it does is moving data from A to B (list handling).