This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Compacting 4x 24 bit values into 3x 32 bits

Hello,

I have application running on an Cortex M3 MCU. The MCU runs a FIR filter which results in arrays of 32 bit values, but only the 24 least significant bits are used. Before storing this data to memory I want to compact four 24 bit words into three 32 bit words. I've written some inline assembly to do this. Its is very crucial that this operation goes as fast as possible. Unfortunately I am not an expert in this area. For example, I am wondering is a single LDM and STM is really the fastest option. Whether or not I can save a few operations, or possibly use 16 bit instructions instead of 32 bits. Or use move combined with shifts. Here is the C function (a more detailed instruction of what it actually does (/is supposed to do) follows at the end of this post).

static void compact(register q31_t * restrict in, register uint32_t * restrict  out)
{
    const q31_t * stop = in + FIR_OUT_BLOCK;
    register uint32_t t1, t2, t3, t4;

    while (in < stop)
    {
        __asm__ (
            "ldmia   %[in]! ,{%[t1], %[t2], %[t3], %[t4]} ;"     // 1
            "bfi     %[t4]  , %[t3], 24, 8                ;"     // 2
            "lsr     %[t3]  , %[t3], 8                    ;"     // 3
            "bfi     %[t3]  , %[t2], 16, 16               ;"     // 4
            "lsr     %[t2]  , %[t2], 16                   ;"     // 5
            "bfi     %[t2]  , %[t1], 8, 24                ;"     // 6
            "stmia   %[out]!,{%[t2], %[t3], %[t4]}        ;"     // 7
            : [in]"+r"(in),  [out]"+r"(out),    // input/output
              [t1]"=r"(t1),  [t2]"=r"(t2),      // temporary registers
              [t3]"=r"(t3),  [t4]"=r"(t4)       //
            :
            : "memory"
        );
    }
}





(I'm using GCC 4.7.)

My questions are whether or not the above code is close to optimal, and whether or not there are any errors in it. The provided array is always a multiple of 4 words. I've tested it with some test cases and it does seem to work, but any comments on the above function would be very much appreciated to help me with my understanding.

Regards,

Vincent

The function above converts four 32 bit words, having 24 bits of data
Right aligned into three 32 bit words. Meaning parts of some words
end up in other words.

Notation: <word number><position>.
e.g. 4H, means word no. 4, H -> highest byte of 24 bits,
      i.e. bits 16 to 23.

1) Starting condition: load 4 values into t1-t4 (registers)
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |  |4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

2) Bitfield insert LSByte t3 into t4
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

3) Right shift by 8 bits t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |  |3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

4) Bitfield insert two LSBytes of t2 into 2 MSBytes of t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

5) Right shift by 16 bits t2
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |  |  |2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

6) Bitfield insert
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

7) Save back to memory, only t2, t3 and t4
                  +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
                  |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
                  +--+--+--+--+ +--+--+--+--+ +--+--+--+--+