This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Compacting 4x 24 bit values into 3x 32 bits

Hello,

I have application running on an Cortex M3 MCU. The MCU runs a FIR filter which results in arrays of 32 bit values, but only the 24 least significant bits are used. Before storing this data to memory I want to compact four 24 bit words into three 32 bit words. I've written some inline assembly to do this. Its is very crucial that this operation goes as fast as possible. Unfortunately I am not an expert in this area. For example, I am wondering is a single LDM and STM is really the fastest option. Whether or not I can save a few operations, or possibly use 16 bit instructions instead of 32 bits. Or use move combined with shifts. Here is the C function (a more detailed instruction of what it actually does (/is supposed to do) follows at the end of this post).

static void compact(register q31_t * restrict in, register uint32_t * restrict  out)
{
    const q31_t * stop = in + FIR_OUT_BLOCK;
    register uint32_t t1, t2, t3, t4;

    while (in < stop)
    {
        __asm__ (
            "ldmia   %[in]! ,{%[t1], %[t2], %[t3], %[t4]} ;"     // 1
            "bfi     %[t4]  , %[t3], 24, 8                ;"     // 2
            "lsr     %[t3]  , %[t3], 8                    ;"     // 3
            "bfi     %[t3]  , %[t2], 16, 16               ;"     // 4
            "lsr     %[t2]  , %[t2], 16                   ;"     // 5
            "bfi     %[t2]  , %[t1], 8, 24                ;"     // 6
            "stmia   %[out]!,{%[t2], %[t3], %[t4]}        ;"     // 7
            : [in]"+r"(in),  [out]"+r"(out),    // input/output
              [t1]"=r"(t1),  [t2]"=r"(t2),      // temporary registers
              [t3]"=r"(t3),  [t4]"=r"(t4)       //
            :
            : "memory"
        );
    }
}





(I'm using GCC 4.7.)

My questions are whether or not the above code is close to optimal, and whether or not there are any errors in it. The provided array is always a multiple of 4 words. I've tested it with some test cases and it does seem to work, but any comments on the above function would be very much appreciated to help me with my understanding.

Regards,

Vincent

The function above converts four 32 bit words, having 24 bits of data
Right aligned into three 32 bit words. Meaning parts of some words
end up in other words.

Notation: <word number><position>.
e.g. 4H, means word no. 4, H -> highest byte of 24 bits,
      i.e. bits 16 to 23.

1) Starting condition: load 4 values into t1-t4 (registers)
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |  |4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

2) Bitfield insert LSByte t3 into t4
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

3) Right shift by 8 bits t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |  |3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

4) Bitfield insert two LSBytes of t2 into 2 MSBytes of t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

5) Right shift by 16 bits t2
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |  |  |2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

6) Bitfield insert
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

7) Save back to memory, only t2, t3 and t4
                  +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
                  |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
                  +--+--+--+--+ +--+--+--+--+ +--+--+--+--+





Parents
  • As far as I understand, you should be able to get it one clock cycle faster by replacing stmia by multiple str instructions; the Cortex-M3 timing docs at ARM Information Center says:

    STR Rx,[Ry,#imm] is always one cycle.

    So since STMIA is 1 + N clock cycles, where N is the number of words stored, you should be able to use STR and gain one clock cycle.

    So first step in optimization would be...

                        ldmia               r0!,{r1-r4}                 /* [5] read four 32-bit words into r1-r4 */

                        bfi                 r4,r3,#24,#8                /* [1] insert bottom 8 bits from r3 into top 8 bits of r4 */

                        lsr                 r3,r3,#8                    /* [1] remove those bits from r3 and make room for 16 bits */

                        bfi                 r3,r2,#16,#16               /* [1] insert bottom 16 bits from r2 into top 16 bits of r3 */

                        lsr                 r2,r2,#16                   /* [1] remove those bits from r2 and make room for 24 bits */

                        bfi                 r2,r1,#8,#24                /* [1] insert the low 24 bits from r1 into the top 24 bits of r2 */

                        stmia               r5!,{r1-r3}                 /* [4] store three 32-bit words and advance r5 */

                                                                        /* (14 clock cycles for 12 bytes = 1.667 clock cycles per byte) */

    ...Unfortunately, after some testing, it turns out that the bfi instruction cannot cross the boundaries of a register.

    That means the following code will not be accepted by your assembler:

                        ldmia               r0!,{r1-r4}                 /* [5] */

                        bfi                 r1,r2,#24,#8                /* [1] */

                        bfi                 r2,r3,#24,#16               /* [1] */

                        bfi                 r3,r4,#24,#24               /* [1] */

                        str                 r1,[r5],#4                  /* [1] */

                        str                 r2,[r5],#4                  /* [1] */

                        str                 r3,[r5],#4                  /* [1] */

                                                                        /* (11 clock cycles for 12 bytes = 0.9167 clock cycles per byte) */

    So we'll need a different approach. If you know that the top 8 bits of the registers are always zero, I think it can be done this way...

                        ldr                 r7,=0xffff0000              /* [2] before we begin, pre-load r7 with our AND-mask */

                        ...

                        ...

                        .rept               10                          /*    the code-block is repeated, in order to "unroll" it. */

                        ldmia               r0!,{r2-r5}                 /* [5] read four 32-bit words */

                        bfi                 r2,r3,#24,#8                /* [1] r2 = $bbAAAAaa */

                        bfi                 r4,r5,#24,#8                /* [1] r4 = $ddCCCCcc */

                        ands                r3,r7,r3,lsl#8              /* [1] r3 = $BBBB0000 */

                        orrs                r3,r3,r5,lsr#8              /* [1] r3 = $BBBBDDDD */

                        str                 r2,[r1],#4                  /* [1] store word 1 */

                        str                 r3,[r1],#4                  /* [1] store word 2 */

                        str                 r4,[r1],#4                  /* [1] store word 3 */

                        .endr

    The above code is tested and seems to be working. Note: if you can store the bytes back at the same location, you can change ldmia to ldm and then save r1 for other things.

    Of course it's a good idea to place a compare instruction at the end of the repeated code-block and then a conditional branch, to see if you've reached the end of the block you're converting.

    It would also pay to save all registers before you start, and then use as many registers as possible, repeat the code-block, and finish up by restoring the registers.

Reply
  • As far as I understand, you should be able to get it one clock cycle faster by replacing stmia by multiple str instructions; the Cortex-M3 timing docs at ARM Information Center says:

    STR Rx,[Ry,#imm] is always one cycle.

    So since STMIA is 1 + N clock cycles, where N is the number of words stored, you should be able to use STR and gain one clock cycle.

    So first step in optimization would be...

                        ldmia               r0!,{r1-r4}                 /* [5] read four 32-bit words into r1-r4 */

                        bfi                 r4,r3,#24,#8                /* [1] insert bottom 8 bits from r3 into top 8 bits of r4 */

                        lsr                 r3,r3,#8                    /* [1] remove those bits from r3 and make room for 16 bits */

                        bfi                 r3,r2,#16,#16               /* [1] insert bottom 16 bits from r2 into top 16 bits of r3 */

                        lsr                 r2,r2,#16                   /* [1] remove those bits from r2 and make room for 24 bits */

                        bfi                 r2,r1,#8,#24                /* [1] insert the low 24 bits from r1 into the top 24 bits of r2 */

                        stmia               r5!,{r1-r3}                 /* [4] store three 32-bit words and advance r5 */

                                                                        /* (14 clock cycles for 12 bytes = 1.667 clock cycles per byte) */

    ...Unfortunately, after some testing, it turns out that the bfi instruction cannot cross the boundaries of a register.

    That means the following code will not be accepted by your assembler:

                        ldmia               r0!,{r1-r4}                 /* [5] */

                        bfi                 r1,r2,#24,#8                /* [1] */

                        bfi                 r2,r3,#24,#16               /* [1] */

                        bfi                 r3,r4,#24,#24               /* [1] */

                        str                 r1,[r5],#4                  /* [1] */

                        str                 r2,[r5],#4                  /* [1] */

                        str                 r3,[r5],#4                  /* [1] */

                                                                        /* (11 clock cycles for 12 bytes = 0.9167 clock cycles per byte) */

    So we'll need a different approach. If you know that the top 8 bits of the registers are always zero, I think it can be done this way...

                        ldr                 r7,=0xffff0000              /* [2] before we begin, pre-load r7 with our AND-mask */

                        ...

                        ...

                        .rept               10                          /*    the code-block is repeated, in order to "unroll" it. */

                        ldmia               r0!,{r2-r5}                 /* [5] read four 32-bit words */

                        bfi                 r2,r3,#24,#8                /* [1] r2 = $bbAAAAaa */

                        bfi                 r4,r5,#24,#8                /* [1] r4 = $ddCCCCcc */

                        ands                r3,r7,r3,lsl#8              /* [1] r3 = $BBBB0000 */

                        orrs                r3,r3,r5,lsr#8              /* [1] r3 = $BBBBDDDD */

                        str                 r2,[r1],#4                  /* [1] store word 1 */

                        str                 r3,[r1],#4                  /* [1] store word 2 */

                        str                 r4,[r1],#4                  /* [1] store word 3 */

                        .endr

    The above code is tested and seems to be working. Note: if you can store the bytes back at the same location, you can change ldmia to ldm and then save r1 for other things.

    Of course it's a good idea to place a compare instruction at the end of the repeated code-block and then a conditional branch, to see if you've reached the end of the block you're converting.

    It would also pay to save all registers before you start, and then use as many registers as possible, repeat the code-block, and finish up by restoring the registers.

Children