This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Bit-Banding. Only 1 bit at a time?

Hi,
I am developing a fixed-point MP3 & ACELP decoder on an Arduino Due. I realize that bit-banding makes a RMW sequence atomic but I notice fields in the Due hardware are multi-bit fields. Is there an atomic way to alter multiple bits? I realize that the atomic RMW can probably allow the applications/IRQ/NMI to resolve which gets access to the bus but if that is the only real use, wouldn't it have been simpler to use atomic test-and-set as the SuperH does?

Examples like the PERIPH_PTSR (bits 0 & 8 used) are classic. Do you write to separate bytes, perform 2 bit-band operations or write to 16 or 32 bits at once?

I do appreciate that this is very low level. I also note that TCM (tightly coupled memory) is mentioned just once and I'm not quite sure if this refers to part of the cache or a PSX-like scratch-pad. The PSX was another machine that needed hand-written assembly language to get the full performance out of it.

Many thanks,
Sean

Top replies

Parents

0 Sean Dunlevy over 7 years ago in reply to 42Bastian Schick

Yeah - you can use r0-r14 if your code is bottom level. LR not needed & SP can be stored. Interrupts have own registers. If you look into Thumb's STRM & LDRM, it's convenient for switch instructions - multiple registers setup & new IP. There are also addressing modes unique to SP. I presume that is so C can Allocate & access RAM.

Only today did I work out why LSLS/LSRS/ASRS/RORS Rd,Rs (powerful instructions) use Rs[7:0] rather than [R4:R0] i.e. shift can be 0-255. You can set up 256 numbers, many of them >16 bit so trick to set up a 32-bit value in 32 bits in total i.e.

MOVS Rd,#<immediate>
RORS Rd,Rd

the shift instructions can manage a few more. Don't forget that code in (E)(E)(P)ROM may read 32-bits to bus i.e. 2 instructions in 1 fetch. Some M0(+) SoCs also have tiny caches e.g. 64 bytes direct-mapped so setting up number with code in IC rather than odd DC reads might be a speed up.

Now ARM has a long history of inclusion of barrel-shifters implicit in data fetches and I strongly suspect that the extra 3 bits consumed a tiny number (single digit) of transistors. I actually asked both Joseph Yiu & Steve Furber about both MULS and shift/rotate and neither remembers the exact reason but Mr Yiu pointed out that the SBCS uses NOT C to save transistors so the count must be tiny.

I've always based my logic on instruction sets as being 'well, the designers must of done it this way for a good reason' and for the Cortex M0+ it was getting a 32-bit CPU into the same number of transistors as an 8-bit CPU. It is clunky, it is annoying and it has some things that make you want to punch the screen but there ARE tricks to be had.

Going back to my 32-bit x 32-bit -->64-bit signed multiply and what is so annoying is...

adds r0,r2

movs r2,#0
adcs r2,r2
lsls r2,r2

The ONLY instructions that take notice of C are ADCS, SBCS and branches. Yes the M0+ is branch friendly but it's still slower than the 17 cycles the thing takes. I HAVE considered bit-stuffing using REV i.e. two 8-bit x 8-bit -->16-bit but it's still too much setup. You can imagine what a speed up 17-->15 cycles is worth.

More or less the multiply & count-leading-zeros are the only 2 fragments needed for MP3 decoding. Joseph Yui was interested in finding out if an M0+ at ≈48 MHz can do. I'm also keen to try it out. 100% asm required ;-)
Cancel
Vote up +1 Vote down

Cancel

Reply

0 Sean Dunlevy over 7 years ago in reply to 42Bastian Schick

Yeah - you can use r0-r14 if your code is bottom level. LR not needed & SP can be stored. Interrupts have own registers. If you look into Thumb's STRM & LDRM, it's convenient for switch instructions - multiple registers setup & new IP. There are also addressing modes unique to SP. I presume that is so C can Allocate & access RAM.

Only today did I work out why LSLS/LSRS/ASRS/RORS Rd,Rs (powerful instructions) use Rs[7:0] rather than [R4:R0] i.e. shift can be 0-255. You can set up 256 numbers, many of them >16 bit so trick to set up a 32-bit value in 32 bits in total i.e.

MOVS Rd,#<immediate>
RORS Rd,Rd

the shift instructions can manage a few more. Don't forget that code in (E)(E)(P)ROM may read 32-bits to bus i.e. 2 instructions in 1 fetch. Some M0(+) SoCs also have tiny caches e.g. 64 bytes direct-mapped so setting up number with code in IC rather than odd DC reads might be a speed up.

Now ARM has a long history of inclusion of barrel-shifters implicit in data fetches and I strongly suspect that the extra 3 bits consumed a tiny number (single digit) of transistors. I actually asked both Joseph Yiu & Steve Furber about both MULS and shift/rotate and neither remembers the exact reason but Mr Yiu pointed out that the SBCS uses NOT C to save transistors so the count must be tiny.

I've always based my logic on instruction sets as being 'well, the designers must of done it this way for a good reason' and for the Cortex M0+ it was getting a 32-bit CPU into the same number of transistors as an 8-bit CPU. It is clunky, it is annoying and it has some things that make you want to punch the screen but there ARE tricks to be had.

Going back to my 32-bit x 32-bit -->64-bit signed multiply and what is so annoying is...

adds r0,r2

movs r2,#0
adcs r2,r2
lsls r2,r2

The ONLY instructions that take notice of C are ADCS, SBCS and branches. Yes the M0+ is branch friendly but it's still slower than the 17 cycles the thing takes. I HAVE considered bit-stuffing using REV i.e. two 8-bit x 8-bit -->16-bit but it's still too much setup. You can imagine what a speed up 17-->15 cycles is worth.

More or less the multiply & count-leading-zeros are the only 2 fragments needed for MP3 decoding. Joseph Yui was interested in finding out if an M0+ at ≈48 MHz can do. I'm also keen to try it out. 100% asm required ;-)
Cancel
Vote up +1 Vote down

Cancel

Children

0 42Bastian Schick over 7 years ago in reply to Sean Dunlevy

You can do in 12 cycles if you are sure about the input values (no carry from low to high word):

	movs	r2,r0			// ab
	lsrs	r0,r0,#16		// a
	lsrs	r3,r1,#16		// c
	movs	r4,r0			// a
	muls	r0,r3			// ac
	muls	r3,r2			// ab*c
	muls	r1,r2			// x
	muls	r2,r4			// cd*a

	lsls	r3,r3,#16
	lsls	r2,r2,#16
	adds	r2,r3
	adcs	r0,r2

0 Sean Dunlevy over 7 years ago in reply to 42Bastian Schick

After the 32-bit x 32-bit -->64-bit, the next two most important code-fragments are 32-bit x 32-bit --> top 32-bits of 64-bit result. Generally it's termed Mulshift32 in macros. The other important one is count leading-zeros. The former takes me 15 cycles, the latter 10 cycles by placing the 64-byte lookup (the De Bruijn sequence) on the Cortex M0's 'Zero Page' i.e. the bottom 256 bytes of memory can use an immediate to setup address and since the offset is defined by size of element read means that the bottom 1024 bytes can be thus accessed.

Along with the CLZ tables, I've put the fixed-point SIN & COS lookup tables into that space. I DO have a tiny 64-byte Cache and while EEPROM can only be accessed at 24MHz (you have to include 1 wait), the SoC does try to read 32-bits at the same time so sometimes a little more code in exchange for keeping the thing from using lookups (that could thrash cache) is faster.

It is more or less an experiment to see if a Cortex M0 can decode 32 kb/s ACELP in real time. That being the case, PragmatIC and others are now printing multi-layer ICs onto plastic (so CPU design is 3D) so if a breakfast cereal (for example) can have a printed CPU & ROM and use backscatter WiFi for power & coms then everything in the shop can talk.

I presume that on such a large market, custom SoCs will be designed and built so all I am testing is just how little silicon (or in this case metal on plastic) can do the job. Hence 100% assembly language.

I have stared at those 3 instructions needed to place the C bit into bit-16 of a register and it is driving me mad. Only ADCS & SBCS use C as input. Since SBCS Rd,Rn is actually Rd = Rd - Rs - NOT C (i.e. when C not set) so I realize that SBCS R2,R2 means R2 will contain either $00000000 (if C was set) or $ffffffff (if C was clear), it saves a MOVS R2,#0 but I cannot see how to use it...
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 7 years ago in reply to Sean Dunlevy

15cycles multiplication? For _all_ input values? Ok, I need to do some more research.

Ha, this "zero page" idea is great. Too bad on some CM0, the first 128 bytes are used by the vector table.

Concerning, CLZ, I never came across deBruijn, only did it in binary search. But this takes from 16 to 32 cycles. A constant number of cycles (even less) is great ....

Regarding "SBCS Rn,Rn": I understand that this could one drive crazy, if you find a cool trick but no place to use it :-)
To bad, most of those tricks can't be used in an RTOS, as most things it does is moving data from A to B (list handling).
Cancel
Vote up 0 Vote down

Cancel