This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Bit-Banding. Only 1 bit at a time?

Hi,
I am developing a fixed-point MP3 & ACELP decoder on an Arduino Due. I realize that bit-banding makes a RMW sequence atomic but I notice fields in the Due hardware are multi-bit fields. Is there an atomic way to alter multiple bits? I realize that the atomic RMW can probably allow the applications/IRQ/NMI to resolve which gets access to the bus but if that is the only real use, wouldn't it have been simpler to use atomic test-and-set as the SuperH does?

Examples like the PERIPH_PTSR (bits 0 & 8 used) are classic. Do you write to separate bytes, perform 2 bit-band operations or write to 16 or 32 bits at once?

I do appreciate that this is very low level. I also note that TCM (tightly coupled memory) is mentioned just once and I'm not quite sure if this refers to part of the cache or a PSX-like scratch-pad. The PSX was another machine that needed hand-written assembly language to get the full performance out of it.

Many thanks,
Sean

Top replies

Parents

0 Sean Dunlevy over 7 years ago in reply to 42Bastian Schick

After the 32-bit x 32-bit -->64-bit, the next two most important code-fragments are 32-bit x 32-bit --> top 32-bits of 64-bit result. Generally it's termed Mulshift32 in macros. The other important one is count leading-zeros. The former takes me 15 cycles, the latter 10 cycles by placing the 64-byte lookup (the De Bruijn sequence) on the Cortex M0's 'Zero Page' i.e. the bottom 256 bytes of memory can use an immediate to setup address and since the offset is defined by size of element read means that the bottom 1024 bytes can be thus accessed.

Along with the CLZ tables, I've put the fixed-point SIN & COS lookup tables into that space. I DO have a tiny 64-byte Cache and while EEPROM can only be accessed at 24MHz (you have to include 1 wait), the SoC does try to read 32-bits at the same time so sometimes a little more code in exchange for keeping the thing from using lookups (that could thrash cache) is faster.

It is more or less an experiment to see if a Cortex M0 can decode 32 kb/s ACELP in real time. That being the case, PragmatIC and others are now printing multi-layer ICs onto plastic (so CPU design is 3D) so if a breakfast cereal (for example) can have a printed CPU & ROM and use backscatter WiFi for power & coms then everything in the shop can talk.

I presume that on such a large market, custom SoCs will be designed and built so all I am testing is just how little silicon (or in this case metal on plastic) can do the job. Hence 100% assembly language.

I have stared at those 3 instructions needed to place the C bit into bit-16 of a register and it is driving me mad. Only ADCS & SBCS use C as input. Since SBCS Rd,Rn is actually Rd = Rd - Rs - NOT C (i.e. when C not set) so I realize that SBCS R2,R2 means R2 will contain either $00000000 (if C was set) or $ffffffff (if C was clear), it saves a MOVS R2,#0 but I cannot see how to use it...
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Sean Dunlevy over 7 years ago in reply to 42Bastian Schick

After the 32-bit x 32-bit -->64-bit, the next two most important code-fragments are 32-bit x 32-bit --> top 32-bits of 64-bit result. Generally it's termed Mulshift32 in macros. The other important one is count leading-zeros. The former takes me 15 cycles, the latter 10 cycles by placing the 64-byte lookup (the De Bruijn sequence) on the Cortex M0's 'Zero Page' i.e. the bottom 256 bytes of memory can use an immediate to setup address and since the offset is defined by size of element read means that the bottom 1024 bytes can be thus accessed.

Along with the CLZ tables, I've put the fixed-point SIN & COS lookup tables into that space. I DO have a tiny 64-byte Cache and while EEPROM can only be accessed at 24MHz (you have to include 1 wait), the SoC does try to read 32-bits at the same time so sometimes a little more code in exchange for keeping the thing from using lookups (that could thrash cache) is faster.

It is more or less an experiment to see if a Cortex M0 can decode 32 kb/s ACELP in real time. That being the case, PragmatIC and others are now printing multi-layer ICs onto plastic (so CPU design is 3D) so if a breakfast cereal (for example) can have a printed CPU & ROM and use backscatter WiFi for power & coms then everything in the shop can talk.

I presume that on such a large market, custom SoCs will be designed and built so all I am testing is just how little silicon (or in this case metal on plastic) can do the job. Hence 100% assembly language.

I have stared at those 3 instructions needed to place the C bit into bit-16 of a register and it is driving me mad. Only ADCS & SBCS use C as input. Since SBCS Rd,Rn is actually Rd = Rd - Rs - NOT C (i.e. when C not set) so I realize that SBCS R2,R2 means R2 will contain either $00000000 (if C was set) or $ffffffff (if C was clear), it saves a MOVS R2,#0 but I cannot see how to use it...
Cancel
Vote up 0 Vote down

Cancel

Children

0 42Bastian Schick over 7 years ago in reply to Sean Dunlevy

15cycles multiplication? For _all_ input values? Ok, I need to do some more research.

Ha, this "zero page" idea is great. Too bad on some CM0, the first 128 bytes are used by the vector table.

Concerning, CLZ, I never came across deBruijn, only did it in binary search. But this takes from 16 to 32 cycles. A constant number of cycles (even less) is great ....

Regarding "SBCS Rn,Rn": I understand that this could one drive crazy, if you find a cool trick but no place to use it :-)
To bad, most of those tricks can't be used in an RTOS, as most things it does is moving data from A to B (list handling).
Cancel
Vote up 0 Vote down

Cancel