This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Bit-Banding. Only 1 bit at a time?

Hi,
    I am developing a fixed-point MP3 & ACELP decoder on an Arduino Due. I realize that bit-banding makes a RMW sequence atomic but I notice fields in the Due hardware are multi-bit fields. Is there an atomic way to alter multiple bits? I realize that the atomic RMW can probably allow the applications/IRQ/NMI to resolve which gets access to the bus but if that is the only real use, wouldn't it have been simpler to use atomic test-and-set as the SuperH does?

Examples like the PERIPH_PTSR (bits 0 & 8 used) are classic. Do you write to separate bytes, perform 2 bit-band operations or write to 16 or 32 bits at once?

I do appreciate that this is very low level. I also note that TCM (tightly coupled memory) is mentioned just once and I'm not quite sure if this refers to part of the cache or a PSX-like scratch-pad. The PSX was another machine that needed hand-written assembly language to get the full performance out of it.

Many thanks,
Sean


  • Yes, bit banding is for single bits only. Though you can use stm if the bits are adjacent. But this also alters the bits one after another.

    Why do you need it to be atomic? Is there another task/thread also writing to the same register?

  • TBH the M3/4 are new to me, I've just been using M0 cores (in assembly language) and was considering if a suitable macro could deal with all HW writes as atomic. I will have to fall back on DMB instructions. You don't miss what you never had. I was looking into TCM with an eye to placing those very atomic operations within zero wait-state RAM.

    Many thanks for your help. It really is appreciated.

  • Sean, what do you want to achieve with the DMB?
    Do you have multiple tasks/threads/exectution streams running? Means you want to be atomic with respect to another task and/or interrupt?

    If so, checkout ldrx/strx. You can implement any atomic access with this.

  •  Dear Bastian. My application requires 3 channels of DMA to run concurrently and given the number available even on the M0+ SoC, I double-buffered them because accurate timing is very important so I hoped to set off the next DMA as quickly as possible i.e. setting off the new interrupt before calculating and setting up the next one (set off in next interrupt).

     The FIFO size is 8 elements so I have a little time and RESx is only 1 bit but many of the others are fields. It's my own fault, I had presumed that ARM's Cortex designs would be as groundbreaking as the ARM7TDMI (for example). I had mistakenly thought they had done something like John E.Zolnowsky's (Motorola/SUN) work in the 1980s-1990s for MMUs that used memory block descriptor templates. It just put each field at the bottom of a 32-bit space.... but I am wrong.

    Many, many thanks.

  • Sean,

    how many DMA channels do you have. Can't you setup one in the background for the next transfer and just activate it, when the previous has finished?
    BTW: IIRC some Atmel AT91SAMxxx could do what you want. But only for GPIOs.
    Maybe you chose the wrong SoC? ;-)

  • Hi Bastian,
                      I am working on a very cost-sensitive project so I am limited to the specification of the SoC. In simple terms, it's a Hitachi Bluetooth processor that has an M4 with FP that is controlling everything. An odd idea at first glance but it's the cheapest solution. I come from a 100% assembly language games-coding background so I look at the theoretical bus bandwidth and allow myself 7.5% extra. Sometimes the most important skill is making a SoC do something nobody else can.

    BTW ARM7 is a beautiful language but Thumb is just awful. The C flag being inverted for some reason... but a reason that makes SBCS useless. About the most useful trick is the read multiple instructions that include IP so a jump table can set up R0-R7 in a single instruction... but so few optimizations.... but I will find them in the end.

    Can anyone explain why  SBC(S) Rd := Rn – Operand2 – NOT(Carry) <---- why NOT carry?

    Has anyone played with ROR{S} Rd, Rm, Rs? The manuals state that the bottom 8 bits of Rs are used. Is this so C can be set if bits 5-7 are non-zero? It has been bugging me for ages. 

    I appreciate your interest and thank you!

  • Can anyone explain why  SBC(S) Rd := Rn – Operand2 – NOT(Carry) <---- why NOT carry?

    Maybe because it reverses ADCS.

    Rd = Rn + Rm

    Rx = Rd - Rm

    => Rx == Rn

    Like the 6502 ;-)

  • Dear Bastian,
                          The ADCS works correctly. If C=1 then an extra 1 is added.
                          The SBCS doesn't do that. If C=0 then an extra 1 is subtracted

    Now I have been lead to believe that Thumb was designed to produce the smallest footprint for code compiled from C/C++ and now even that is being questioned. The instruction set is almost the bare minimum needed to do the job. Is their a reason why it would compile smaller C?

    The RORS is another puzzle. Can you think of a reason why the bottom 8 bits of a register value is used to rotate a second register when anything >32 appears to be pointless. As I surmised, maybe it's to set the flags but I haven't found any code that uses the RORS Rd,Rs instruction.

    I spent a lot of time optimizing certain M0 functionality such as the fastest 32-bit x 32-bit --->64-bit (after 5 different attempts I found that 17 cycles is the minimum) but I never worked out how to use the bottom 32-bits of the result you get from a MULS and had to use 4 MULS. In ARM v6 the MULS doesn't change the C & V flags (previous versions did) and I'm baffled as to why they would bother to change that. I keep thinking that somewhere in Texas, someone who developed the microcode has a fantastic plan but it either failed to happen or the functionality wasn't shared.

  • 17cy for the multiply isn't bad

    I keep thinking that somewhere in Texas, someone who developed the microcode has a fantastic plan but it either failed to happen or the functionality wasn't shared.

    Yes, I wonder, what the design goal for the M0 were. Maybe they had a maximum gate count.  I don't think they use microcode at all, don't you.
    But actually, I had these thoughts on many architectures in the past (RX, V850, PowerPC VLE). I sometimes wonder if they ever ask assembly programmers about the instructions they want to implement.
    Concering RORS, I think I haven't used it at all in the past 19 years. Oh, twice in our kernel: "// no eRORS" ;-)

  • Dear Bastian,
                          I was under the impression that Thumb was arrived at by analysing compiled C/C++. Cortex marked the beginning of it being the only instruction set. A few things seem odd. That only the MOVS & ADDS can operate on hi-registers seems pretty brutal. Especially since you can free up the LR & even the SP allowing 15 registers to be used.

    Power and gate-count were the critical things but as I have mentioned, why clean up the MULS? I actually intended to write an MP3 (64 kb/s mono) / ACELP (32 kb/s mono) decoder i.e. an Audible player using an M0+ (If a dedicated player is cheap enough, it would be of great benefit to schools for example). There aren't a huge number of CLZs needed so in fact, the only thing that stopped me was that I couldn't get a 32-bit x 32-bit --->64-bit multiply in under 17 cycles. I looked at many compiler outputs and started with a clean-sheet several times but it always ended up 17 cycles.

    I wish I knew who it was who updated that MULS because it FEELS like the bottom 32-bits should be usable as is but in every approach, 16-bit partials had to be used. What advantage does preserving the C & V flags while updating the N & Z flags have? D you know what I mean? It feels like there is a very good but unstated reason.

    From Tetris in 1985 to the Gameboy versions of Tomb Raider, I have worked exclusively in assembly language and in every other case I could understand (or at least make sense of) the designers thought processes and it's driving me mad. As I mentioned, there is actually quite a lot of tricks and the SP has unique addressing modes so I can get close to using 100% of the bus bandwidth but that MULS thing is driving me crazy.

     I see you have worked on many of the same CPUs as I have. The Virtual Boy (V850) and Sega 32X/Saturn (SH2) have pretty good instruction sets (and use the branch delay slots!) so I can only presume gate-count is the number 1 consideration with Cortex. I looked into fabrication process size of Cortex and they are using 10 year old sizes i.e. they are using production facilities that are out of date for most applications. I mean it's a really clever strategy.

     But that MULS. I feel like Sisyphus. I keep pushing the silicon up the hill only for it to roll back down again :-)

    Nice to talk to a real expert. I really value it.
    Sean

  • that I couldn't get a 32-bit x 32-bit --->64-bit multiply in under 17 cycles. I looked at many compiler outputs and started with a clean-sheet several times but it always ended up 17 cycles.

    Sean, I am trying all the week (ok, whenever I get some spare time), but don't get it with 17cycles. Only 18.

    Are you sure, your version works for _all_ inputs, for example: 0x120000*0xffffffff ?

    ------------------------

    Ok, got the 17cy version. :-)  Thought also a 12cy version, but not for all inputs :(

  • Hi Bastian,
                     That figure doesn't include PUSH/POP R4 or the BX LR. I'm just using the code that GNU outputs with the stuff removed for in-line assembly language. I agree that their may well be some faster methodology. I THINK that the Cortex (Core Texas) MULS is the one described in US Patent 7447726. 

    The one thing that really makes it a pain is having to use 4 MULS when the bottom 32-bits of the result can be calculated using just 1 MULS. When only the top 32 bits of the result are needed, a fast way to do that would also be a big speedup.

  • Yes, left out saving r4 as well. No need to push though. You can move it to r12 which is also a scratch register, at least in the C ABI ;-)

    Regarding 3 muls: The problem is to find out the number of carries :( 0,1 or 2. But I fear finding those eats up the benefit.
    I also tried the Karatsuba method, but this is not working if one looses the upper 32bits :(

    Anyway, it was a nice brain-training.

  • Yeah - you can use r0-r14 if your code is bottom level. LR not needed & SP can be stored. Interrupts have own registers. If you look into Thumb's STRM & LDRM, it's convenient for switch instructions - multiple registers setup & new IP. There are also addressing modes unique to SP. I presume that is so C can Allocate & access RAM.

    Only today did I work out why LSLS/LSRS/ASRS/RORS Rd,Rs (powerful instructions) use Rs[7:0] rather than [R4:R0] i.e. shift can be 0-255. You can set up 256 numbers, many of them >16 bit so trick to set up a 32-bit value in 32 bits in total i.e.

    MOVS Rd,#<immediate>
    RORS Rd,Rd

    the shift instructions can manage a few more. Don't forget that code in (E)(E)(P)ROM may read 32-bits to bus i.e. 2 instructions in 1 fetch. Some M0(+) SoCs also have tiny caches e.g. 64 bytes direct-mapped so setting up number with code in IC rather than odd DC reads might be a speed up.

    Now ARM has a long history of inclusion of barrel-shifters implicit in data fetches and I strongly suspect that the extra 3 bits consumed a tiny number (single digit) of transistors. I actually asked both Joseph Yiu & Steve Furber about both MULS and shift/rotate and neither remembers the exact reason but Mr Yiu pointed out that the SBCS uses NOT C to save transistors so the count must be tiny.

    I've always based my logic on instruction sets as being 'well, the designers must of done it this way for a good reason' and for the Cortex M0+ it was getting a 32-bit CPU into the same number of transistors as an 8-bit CPU. It is clunky, it is annoying and it has some things that make you want to punch the screen but there ARE tricks to be had.

    Going back to my 32-bit x 32-bit -->64-bit signed multiply and what is so annoying is...

    adds r0,r2

    movs r2,#0
    adcs r2,r2
    lsls r2,r2

    The ONLY instructions that take notice of C are ADCS, SBCS and branches. Yes the M0+ is branch friendly but it's still slower than the 17 cycles the thing takes. I HAVE considered bit-stuffing using REV i.e. two 8-bit x 8-bit -->16-bit but it's still too much setup. You can imagine what a speed up 17-->15 cycles is worth.

    More or less the multiply & count-leading-zeros are the only 2 fragments needed for MP3 decoding. Joseph Yui was interested in finding out if an M0+ at ≈48 MHz can do. I'm also keen to try it out. 100% asm required ;-)

  • You can do in 12 cycles if you are sure about the input values (no carry from low to high word):

    	movs	r2,r0			// ab
    	lsrs	r0,r0,#16		// a
    	lsrs	r3,r1,#16		// c
    	movs	r4,r0			// a
    	muls	r0,r3			// ac
    	muls	r3,r2			// ab*c
    	muls	r1,r2			// x
    	muls	r2,r4			// cd*a
    
    	lsls	r3,r3,#16
    	lsls	r2,r2,#16
    	adds	r2,r3
    	adcs	r0,r2