This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How long bitfields on which ARM?

I need to be able to handle long bitfields as effectively as possible. Right now I need up to 64 bits in length.

Are there instructions to set, clear and test individual bits in one cycle available for some of the architectures? Which? Particularly, will the M0+ handle it (which only does reduced thumb2)? If not, which comparable?

What I find confuses me. In a thumb2 ref card I found that "Width of bitfield. <width> + <lsb> must be <= 32." But some 5 years ago I programmed some on a STR91xF ARM9 processor, and there was some talk about l-o-n-g bit arrays that could be handled in one cycle, but there was some 1024 bytes of microcoded table for this. (See, I am already long afloat, in deep water! Maybe this was for all kind of masks?)

Also, what would happen if I need to set or clear (like) bit 27 and bit 60 in one instruction? Will compilers (which?) then treat a full 32 bits word times two, a 64 bits word, or will it handle only byte 3 and byte 7 (starting at byte 0) and do the trick on them? Is the barrel shifter part of this?

Aclassifier

Øyvind Teig | Some of my blog notes

Parents
  • Hi Øyvind, welcome to the community.

    Uhm, which question should I answer...

    The question is unfortunately not easy to answer, as there are various tricks that can be performed in special cases.

    A quick answer, but slightly imprecise: All Cortex-M microcontrollers are 32-bit, which means they can maximum handle 32 bits at a time.

    I'll give some (crazy) answers to (parts of) the questions that comes to mind; but I'll leave the answer 'open', because I can not fully cover all possible ARM based microcontrollers, as I do not know all the individual hardware.

    The answer would also depend to a large degree on whether or not the operations are 'constant' or if they're 'dynamic'.

    If the operations you need are often constant, you'll have more possibilities for finding quicker ways to solve them.

    In bitfield cases, I strongly believe that it will pay to operate on 32 bits rather than on 8 bits, because loading a byte and storing a byte will cost exactly the same amount of clock cycles. Loading, modifying and storing a 32-bit value might give much better possibilities than loading, modifying and storing an 8-bit value.

    However. There are tricks, that one should think of.

    Say you want to set bits 15:8 of a 64-bit word, then you load a register with the value 0xff and store it directly to memory, without first loading the 'current' value from memory.

    -But there is more than you usually would think of. The Bit-banding feature, where a certain region of memory is "magnified" into an area where only bits are manipulated, so that if you write 0xffffffff to one of these addresses, then you'll set only 4 bits.

    In some cases, it will be faster to set those 4 bits this way, instead of loading the 32-bit value, modifying it and storing it back.

    If you can combine the Bit-banding with a DMA, then you might be able to boost the performance, but this depends on the job.

    I believe that you can 'shift' a large bitfield using the DMA (I have not tried this, though), by setting a starting address, and ending address, and a length, then copying the block by starting the DMA.

    The Cortex-M7 have 64-bit floating point registers. At this point in time, I do not know whether it's possible to do some hacky tricks, where you keep your value in a 64-bit register, and then change the contents of the upper/lower part of the register by loading a new value into a 32-bit floating point register (because they share the same register file, so that two 32-bit registers share one 64-bit register.

    Some Cortex-M processors have special memory addresses that you can also benefit from.

    Freescale have their Bit Manipulation Engine. While not strictly designed for bit-manipulation, you can actually do some cool tricks with STMicroelectronics's Chrom-ART, if you abuse it a bit and use Pixel Format Conversion into and out of the Bit Band region. So you can think of this as some microcontrollers 'extend' the instruction set by changing the functionality of hardware addresses. The Chrom-ART accelerator is designed to be a graphics accelerator with blending capabilities, but it can be used for more than that. In addition to what's mentioned already, it can also be used for fast data decompression by using the palette to expand index values into 3 or 4 bytes; this can be combined with blending to take the decompression one step further.

    If you can pick and choose any ARM microcontroller/processor, then the Cortex-A15 might be interesting, because this is a 64-bit architecture. So registers would be 64 bits wide. Try taking a look at the Allwinner A80, if Cortex-A15 sounds interesting to you.

    Cortex-M3 and later have the ability to single-shift a 64 bit value using two clock cycles.

    If you need to shift more than a single position, then you'll most likely need 3 clock cycles, but you should never need more than 3 if you have the values in registers already.

    You should also know that the MUL instruction on Cortex-M4 is a single-cycle instruction, which means that if you need to 'duplicate' bit-fields, you may want to multiply for instance a 4-bit value by 0x44444421. I often multiply by 17 (0x11) and 257 (0x0101) for color values and when I do interpolation. Using multiply is a fairly good idea, when we're speaking about patterns in the bit-fields.

    For operations on the top 16 bits of a word, remember that the MOVT instruction does not change the low 16 bits of a register.

    When speaking Cortex-M, the Cortex-M4 and Cortex-M7 are probably the best candidates.

    The Cortex-M4 supports multiplying 64 bits by 64 bits with a 64-bit result in a single clock cycle.

    It also has a 64-bit multiply-and-accumulate, which would also be able to do some good in 64-bit field operations.

    The Cortex-M7 will always be faster/better than the Cortex-M4; because the Cortex-M7 can read/write in parallel with modifying data.

    If you can reveal what kind of job you're up to, it might be easier to come up with suggestions.

    If you're building a bit-buffer (eg. a cache) - for instance for compression/decompression, let me know.

Reply
  • Hi Øyvind, welcome to the community.

    Uhm, which question should I answer...

    The question is unfortunately not easy to answer, as there are various tricks that can be performed in special cases.

    A quick answer, but slightly imprecise: All Cortex-M microcontrollers are 32-bit, which means they can maximum handle 32 bits at a time.

    I'll give some (crazy) answers to (parts of) the questions that comes to mind; but I'll leave the answer 'open', because I can not fully cover all possible ARM based microcontrollers, as I do not know all the individual hardware.

    The answer would also depend to a large degree on whether or not the operations are 'constant' or if they're 'dynamic'.

    If the operations you need are often constant, you'll have more possibilities for finding quicker ways to solve them.

    In bitfield cases, I strongly believe that it will pay to operate on 32 bits rather than on 8 bits, because loading a byte and storing a byte will cost exactly the same amount of clock cycles. Loading, modifying and storing a 32-bit value might give much better possibilities than loading, modifying and storing an 8-bit value.

    However. There are tricks, that one should think of.

    Say you want to set bits 15:8 of a 64-bit word, then you load a register with the value 0xff and store it directly to memory, without first loading the 'current' value from memory.

    -But there is more than you usually would think of. The Bit-banding feature, where a certain region of memory is "magnified" into an area where only bits are manipulated, so that if you write 0xffffffff to one of these addresses, then you'll set only 4 bits.

    In some cases, it will be faster to set those 4 bits this way, instead of loading the 32-bit value, modifying it and storing it back.

    If you can combine the Bit-banding with a DMA, then you might be able to boost the performance, but this depends on the job.

    I believe that you can 'shift' a large bitfield using the DMA (I have not tried this, though), by setting a starting address, and ending address, and a length, then copying the block by starting the DMA.

    The Cortex-M7 have 64-bit floating point registers. At this point in time, I do not know whether it's possible to do some hacky tricks, where you keep your value in a 64-bit register, and then change the contents of the upper/lower part of the register by loading a new value into a 32-bit floating point register (because they share the same register file, so that two 32-bit registers share one 64-bit register.

    Some Cortex-M processors have special memory addresses that you can also benefit from.

    Freescale have their Bit Manipulation Engine. While not strictly designed for bit-manipulation, you can actually do some cool tricks with STMicroelectronics's Chrom-ART, if you abuse it a bit and use Pixel Format Conversion into and out of the Bit Band region. So you can think of this as some microcontrollers 'extend' the instruction set by changing the functionality of hardware addresses. The Chrom-ART accelerator is designed to be a graphics accelerator with blending capabilities, but it can be used for more than that. In addition to what's mentioned already, it can also be used for fast data decompression by using the palette to expand index values into 3 or 4 bytes; this can be combined with blending to take the decompression one step further.

    If you can pick and choose any ARM microcontroller/processor, then the Cortex-A15 might be interesting, because this is a 64-bit architecture. So registers would be 64 bits wide. Try taking a look at the Allwinner A80, if Cortex-A15 sounds interesting to you.

    Cortex-M3 and later have the ability to single-shift a 64 bit value using two clock cycles.

    If you need to shift more than a single position, then you'll most likely need 3 clock cycles, but you should never need more than 3 if you have the values in registers already.

    You should also know that the MUL instruction on Cortex-M4 is a single-cycle instruction, which means that if you need to 'duplicate' bit-fields, you may want to multiply for instance a 4-bit value by 0x44444421. I often multiply by 17 (0x11) and 257 (0x0101) for color values and when I do interpolation. Using multiply is a fairly good idea, when we're speaking about patterns in the bit-fields.

    For operations on the top 16 bits of a word, remember that the MOVT instruction does not change the low 16 bits of a register.

    When speaking Cortex-M, the Cortex-M4 and Cortex-M7 are probably the best candidates.

    The Cortex-M4 supports multiplying 64 bits by 64 bits with a 64-bit result in a single clock cycle.

    It also has a 64-bit multiply-and-accumulate, which would also be able to do some good in 64-bit field operations.

    The Cortex-M7 will always be faster/better than the Cortex-M4; because the Cortex-M7 can read/write in parallel with modifying data.

    If you can reveal what kind of job you're up to, it might be easier to come up with suggestions.

    If you're building a bit-buffer (eg. a cache) - for instance for compression/decompression, let me know.

Children
No data