This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How long bitfields on which ARM?

I need to be able to handle long bitfields as effectively as possible. Right now I need up to 64 bits in length.

Are there instructions to set, clear and test individual bits in one cycle available for some of the architectures? Which? Particularly, will the M0+ handle it (which only does reduced thumb2)? If not, which comparable?

What I find confuses me. In a thumb2 ref card I found that "Width of bitfield. <width> + <lsb> must be <= 32." But some 5 years ago I programmed some on a STR91xF ARM9 processor, and there was some talk about l-o-n-g bit arrays that could be handled in one cycle, but there was some 1024 bytes of microcoded table for this. (See, I am already long afloat, in deep water! Maybe this was for all kind of masks?)

Also, what would happen if I need to set or clear (like) bit 27 and bit 60 in one instruction? Will compilers (which?) then treat a full 32 bits word times two, a 64 bits word, or will it handle only byte 3 and byte 7 (starting at byte 0) and do the trick on them? Is the barrel shifter part of this?

Aclassifier

Øyvind Teig | Some of my blog notes

  • Hi Øyvind, welcome to the community.

    Uhm, which question should I answer...

    The question is unfortunately not easy to answer, as there are various tricks that can be performed in special cases.

    A quick answer, but slightly imprecise: All Cortex-M microcontrollers are 32-bit, which means they can maximum handle 32 bits at a time.

    I'll give some (crazy) answers to (parts of) the questions that comes to mind; but I'll leave the answer 'open', because I can not fully cover all possible ARM based microcontrollers, as I do not know all the individual hardware.

    The answer would also depend to a large degree on whether or not the operations are 'constant' or if they're 'dynamic'.

    If the operations you need are often constant, you'll have more possibilities for finding quicker ways to solve them.

    In bitfield cases, I strongly believe that it will pay to operate on 32 bits rather than on 8 bits, because loading a byte and storing a byte will cost exactly the same amount of clock cycles. Loading, modifying and storing a 32-bit value might give much better possibilities than loading, modifying and storing an 8-bit value.

    However. There are tricks, that one should think of.

    Say you want to set bits 15:8 of a 64-bit word, then you load a register with the value 0xff and store it directly to memory, without first loading the 'current' value from memory.

    -But there is more than you usually would think of. The Bit-banding feature, where a certain region of memory is "magnified" into an area where only bits are manipulated, so that if you write 0xffffffff to one of these addresses, then you'll set only 4 bits.

    In some cases, it will be faster to set those 4 bits this way, instead of loading the 32-bit value, modifying it and storing it back.

    If you can combine the Bit-banding with a DMA, then you might be able to boost the performance, but this depends on the job.

    I believe that you can 'shift' a large bitfield using the DMA (I have not tried this, though), by setting a starting address, and ending address, and a length, then copying the block by starting the DMA.

    The Cortex-M7 have 64-bit floating point registers. At this point in time, I do not know whether it's possible to do some hacky tricks, where you keep your value in a 64-bit register, and then change the contents of the upper/lower part of the register by loading a new value into a 32-bit floating point register (because they share the same register file, so that two 32-bit registers share one 64-bit register.

    Some Cortex-M processors have special memory addresses that you can also benefit from.

    Freescale have their Bit Manipulation Engine. While not strictly designed for bit-manipulation, you can actually do some cool tricks with STMicroelectronics's Chrom-ART, if you abuse it a bit and use Pixel Format Conversion into and out of the Bit Band region. So you can think of this as some microcontrollers 'extend' the instruction set by changing the functionality of hardware addresses. The Chrom-ART accelerator is designed to be a graphics accelerator with blending capabilities, but it can be used for more than that. In addition to what's mentioned already, it can also be used for fast data decompression by using the palette to expand index values into 3 or 4 bytes; this can be combined with blending to take the decompression one step further.

    If you can pick and choose any ARM microcontroller/processor, then the Cortex-A15 might be interesting, because this is a 64-bit architecture. So registers would be 64 bits wide. Try taking a look at the Allwinner A80, if Cortex-A15 sounds interesting to you.

    Cortex-M3 and later have the ability to single-shift a 64 bit value using two clock cycles.

    If you need to shift more than a single position, then you'll most likely need 3 clock cycles, but you should never need more than 3 if you have the values in registers already.

    You should also know that the MUL instruction on Cortex-M4 is a single-cycle instruction, which means that if you need to 'duplicate' bit-fields, you may want to multiply for instance a 4-bit value by 0x44444421. I often multiply by 17 (0x11) and 257 (0x0101) for color values and when I do interpolation. Using multiply is a fairly good idea, when we're speaking about patterns in the bit-fields.

    For operations on the top 16 bits of a word, remember that the MOVT instruction does not change the low 16 bits of a register.

    When speaking Cortex-M, the Cortex-M4 and Cortex-M7 are probably the best candidates.

    The Cortex-M4 supports multiplying 64 bits by 64 bits with a 64-bit result in a single clock cycle.

    It also has a 64-bit multiply-and-accumulate, which would also be able to do some good in 64-bit field operations.

    The Cortex-M7 will always be faster/better than the Cortex-M4; because the Cortex-M7 can read/write in parallel with modifying data.

    If you can reveal what kind of job you're up to, it might be easier to come up with suggestions.

    If you're building a bit-buffer (eg. a cache) - for instance for compression/decompression, let me know.

  • Hi Aclassifier,


    if you use a C compiler you can handle up to 64bit width fields by using the long long integer type.
    For example please look at the following example. If you would use an assembler you would have many options as was mentioned. The limitation of the bitfield instruction comes from the fact it can process 32bit data at one time.

    [source code]

    union {
    char a[10];
    long long b:64;
    } X;

    main()
    {
       X.b=0x0123456789abcdefLL;
    }

    [disasseble code]

    00000000 <main>:
       0:   b510            push    {r4, lr}
       2:   4a03            ldr     r2, [pc, #12]   (10 <main+0x10>)
       4:   4b03            ldr     r3, [pc, #12]   (14 <main+0x14>)
       6:   4c04            ldr     r4, [pc, #16]   (18 <main+0x18>)
       8:   6013            str     r3, [r2, #0]
       a:   6054            str     r4, [r2, #4]
       c:   bd10            pop     {r4, pc}
       e:   46c0            nop                     (mov r8, r8)
      10:   00000000        .word   0x00000000
      14:   89abcdef        .word   0x89abcdef
      18:   01234567        .word   0x01234567

    Best regards,
    Yasuhiko Koumoto.

  • Thank you, guys! Your answers have been very helpful! I need to learn more than I think I need to know.

    I have a CSP-type channel based scheduler (Publication details by Øyvind Teig) where signalling on a channel is done by setting a bit in a bitfield. Right now I have 39 channels (synchronous with data, asynch without data=signal and finally timeout signals).

    Also, the selective choice (ALT) implementation for each CSP process needs a bitfield that bit-by-bit matches the channel bitfield. This holds the set of channels that's present in the ALT set and then contains a mask that's used to clear all those bits when one guard of the ALT is taken.

    With an 8 bit processor I have used byte_8, int_16, long_32 or long_long_64 (all used as unsigned), automatically handled with width dependent macros. For single bit handling there are several combinations of setting, testing and clearing with dynamic index and several with constant bit index. Then there is masking with dynamic or constant mask. Then our compiler on some of these cases shoots directly on the bit, which I have studied, and for some cases a small dynamic bit handling library was written. And some times it takes all 8 bytes in, clears one bit of them and writes all 8 bytes back!-(

    When recompiling this system for the ARM I am sure there would be special cases too. What I learn from you is that I should disregard byte_8 and int_16 (with 39 channels those cases wouldn't have been seen anyhow). I have not done any assembly coding for this (sorry, I forgot to tell), so I would basically rely on the compiler. Also I think I have learned that there would be differences with regard to processors.

    None of you triggered on the mask(?)-array that I think was present on the STR91xF ARM9?

    May I ask what your gut feeling on M0+ vs M3/M4 architectures would be?

    Best regards

    Øyvind Teig, Trondheim, Norway

  • Unfortunately, I've not worked with ARM9, so I do not know the mask-array feature.

    (I'll admit that it took me a while to find out that CSP is an abbreviation of Communicating Sequential Processes)

    I was thinking a bit about masks in GPIO-registers, but for some reason, I did not mention them.

    Many of NXP's microcontrollers allow you to set a mask for the GPIO pins. I mention these, because some of them supports 32 pin (32-bit) GPIO ports. I do not know whether or not this is useful, however, in addition to this mask, the GPIO pins also have atomic access set and clear registers (some allow for toggling as well). So far, I believe NXP's LPC175x-LPC178x, LPC18xx, LPC43xx and LPC541xx have the quickest I/O ports that support 32 pins per port.

    You might not need to use any pins on the microcontroller, but you could still use these registers as '32-bit RAM'. As far as I know, Microchip also makes microcontrollers that support 32-pin (32-bit) ports.

    Regarding using the Cortex-M0; if you need real fast access, then the Cortex-M0 might be too limited.

    By now, you probably know that ...

    • The Cortex-M0 and Cortex-M0+ instruction sets are only 16-bit.
    • The Cortex-M3 has all the Cortex-M0/Cortex-M0+ instructions, plus a bunch of extra instructions.
    • The Cortex-M4 has all the Cortex-M3 instructions, plus some neat DSP functions.
    • The Cortex-M4F (with floating point unit) has all the Cortex-M4 instructions + 32-bit floating point instructions.
    • The Cortex-M7 has all the Cortex-M4 instructions + 64-bit floating point.

    In addition, the Cortex-M7 is basically 1.63 times as fast per MHz as the Cortex-M4 (my estimation).

    If you code in assembly-language, you might be able to get a performance that's twice as fast per MHz than if you run the code on the Cortex-M4.

    Some of the Cortex-M4 and Cortex-M7 DSP instructions might be interesting for you as well. The UXTA and SXTA instructions can extract an 8- or 16-bit value from one register and add it to another register. The operation includes rotating the source register first.

    Even though the Cortex-M0 only has a 16-bit instruction set, it's still able to work on 32-bit integers, but since the instruction set does not allow for the same barrel-shifter tricks and conditional instruction execution, the code will be larger and slower.

    However, some Cortex-M0/Cortex-M0+ implementations include Bit-Banding. The Bit-Banding is an optional feature, that the vendors may include if they wish. Bit-Banding is particular useful when the microcontroller has more than a single core (for instance a Cortex-M4 + a Cortex-M0 core), as Bit-Banding allows for atomic operations.

  • aclassifier wrote:

    In a thumb2 ref card I found that "Width of bitfield. <width> + <lsb> must be <= 32."

    This means that the bit-field instructions on Cortex-M devices can not handle bit operations larger than 32 bit.

    It also means that you can't 'wrap' bit fields, so the following...

                        ubfx                r0,r1,#24,#16

    ... will not be valid. You would probably expect (like I did) that the above instruction would copy the top 8 bits of r1 to the bottom 8 bits of r0 and the bottom 8 bits of r1 to bits 15:8 of r0. But the instruction is invalid, because the combination of the start and length parameters do not exist, thus there is no opcode available for an instruction with those parameters.

    It is possible to do the operation using a different approach, though:

                        movw                r7,#0xffff

                        ands                r0,r7,r1,ror#24

    -And of course, it would pay to reuse r7 if more bit fields of this kind needs to be extracted.

    The above example is using the barrel-shifter; this is present on Cortex-M3 and later, but isn't available on Cortex-M0 (due to the limited number of opcodes).

  • Hi aclassifier,

    my good feeling regarding M0+ vs M3/M4 is compact and low power. Compared with M0, M0+ is high performance because of the shorter stage pipeline and the single cycle I/O (is used for GPIO). As for the bit manipulation, MCU vendors would add some complementations to a chip. For example Kinetis L series (of which CPU is M0+) has BME (Bit Manipulation Engile). BME can handle both single bit and bit fields. Although I am not an agent of the freescale, I have good impressions on Kinetis L series.


    Best regards,
    Yasuhiko Koumoto.