This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Compacting 4x 24 bit values into 3x 32 bits

Hello,

I have application running on an Cortex M3 MCU. The MCU runs a FIR filter which results in arrays of 32 bit values, but only the 24 least significant bits are used. Before storing this data to memory I want to compact four 24 bit words into three 32 bit words. I've written some inline assembly to do this. Its is very crucial that this operation goes as fast as possible. Unfortunately I am not an expert in this area. For example, I am wondering is a single LDM and STM is really the fastest option. Whether or not I can save a few operations, or possibly use 16 bit instructions instead of 32 bits. Or use move combined with shifts. Here is the C function (a more detailed instruction of what it actually does (/is supposed to do) follows at the end of this post).

static void compact(register q31_t * restrict in, register uint32_t * restrict  out)
{
    const q31_t * stop = in + FIR_OUT_BLOCK;
    register uint32_t t1, t2, t3, t4;

    while (in < stop)
    {
        __asm__ (
            "ldmia   %[in]! ,{%[t1], %[t2], %[t3], %[t4]} ;"     // 1
            "bfi     %[t4]  , %[t3], 24, 8                ;"     // 2
            "lsr     %[t3]  , %[t3], 8                    ;"     // 3
            "bfi     %[t3]  , %[t2], 16, 16               ;"     // 4
            "lsr     %[t2]  , %[t2], 16                   ;"     // 5
            "bfi     %[t2]  , %[t1], 8, 24                ;"     // 6
            "stmia   %[out]!,{%[t2], %[t3], %[t4]}        ;"     // 7
            : [in]"+r"(in),  [out]"+r"(out),    // input/output
              [t1]"=r"(t1),  [t2]"=r"(t2),      // temporary registers
              [t3]"=r"(t3),  [t4]"=r"(t4)       //
            :
            : "memory"
        );
    }
}





(I'm using GCC 4.7.)

My questions are whether or not the above code is close to optimal, and whether or not there are any errors in it. The provided array is always a multiple of 4 words. I've tested it with some test cases and it does seem to work, but any comments on the above function would be very much appreciated to help me with my understanding.

Regards,

Vincent

The function above converts four 32 bit words, having 24 bits of data
Right aligned into three 32 bit words. Meaning parts of some words
end up in other words.

Notation: <word number><position>.
e.g. 4H, means word no. 4, H -> highest byte of 24 bits,
      i.e. bits 16 to 23.

1) Starting condition: load 4 values into t1-t4 (registers)
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |  |4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

2) Bitfield insert LSByte t3 into t4
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |3H:3M:3L| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

3) Right shift by 8 bits t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |  |  |3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

4) Bitfield insert two LSBytes of t2 into 2 MSBytes of t3
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |2H:2M:2L| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

5) Right shift by 16 bits t2
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |  |  |  |2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

6) Bitfield insert
    +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
    |  |1H:1M:1L| |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
    +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

7) Save back to memory, only t2, t3 and t4
                  +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+
                  |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L|
                  +--+--+--+--+ +--+--+--+--+ +--+--+--+--+





Parents
  • Hi Vincent.

    Uhm, you can reduce the number of LLIs to a lot fewer. This is complicated, because it will require the LLIs to use tables and modify themselves.

    I've done this already (not for byte-stuffing, but for a demonstration on the NXP LPC1768).

    The attachment shows how I use the DMA to blink a LED. Not impressive in itself, but what's special is that I use a table in order to change speed.

    You could use tables in the same way to swap those bytes around (and even place them in a nice order).

    The DMA is quite fast, and it can work side-by-side with the CPU, but make sure that you wait until the DMA is finished, before a new FIR buffer is written.

    What you could do is that you send your FIR output to two different buffers, then while the FIR is writing to the current buffer, the DMA is fixing up the other buffer, which was written last time. You probably already know this; it's called double-buffering and is used both in preparing Audio and Video for output - but can also be applied here.

    About the stack pointer, yes, you have the option to use either the Main Stack Pointer (default) or the Process Stack Pointer. You need to use MSR to set up the PSP.

    > "The one register too short is too bad. I don't understand what you mean by squeeze register usage. You mean I could always load the last value after the first pack when I unroll by 3?"

    Sorry, I've been a bit unclear on that. What I mean is that you load for instance 10 registers, then process the first 8 registers, process the 9th register and then load 9 more registers, while keeping the 10th register. This might save you 1/3 clock cycle per block. Sometimes you can save some, sometimes it's just the same, so you'll need to create a good overview before you start.

    You have another alternative, though. The mask I was speaking about, can be loaded right after processing the very first register, as  you would have a register free.

    Unfortunately, most likely loading the mask will take 2 clock cycles, so it will only pay if you can do 3 chunks at a time.

    > "I must say your code is very efficient, at the point that I should start looking at improvements at other parts of my code. I don't think I can output the samples left aligned, since that would mean I would need to increase my multiplication, causing overflows. If I could, how would that help things?"

    Thank you; I think your original code is quite good as well; much better than what most implementations would look like.

    -My packing will only work correctly, if you are sure that your code delivers the upper 8 bits as zero always. If you know it does this, no problem, there's no need to change your FIR code.

    The instruction...

                        orrs                r3,r3,r5,lsr#8              /* [1] r3 = $BBBBDDDD */

    ...requires that the top 8 bits of r5 is zero, because it's mixing using bitwise OR. This is the only place I could not find a safer way to mix the two registers.

    So if r5 holds $xxDDDDdd, then the 'xx' would overwrite the top/mid byte of r3, which means the result would be $BBxxDDDD - we do not want that to happen.

    I imagine this would only be happening if your FIR results are signed integers, so that they would sometimes store $FF in the top byte and sometimes $00.

    But as I do not know the FIR code, I can't do any checking. You've probably already verified how it behaves regarding this.

    But if not, you could left-shift one of your factors by 8, then the overflow would not matter at all. Those 8 bits would just be shifted out (discarded). Your byte-stuffing would be slightly different, but I think it could still pack using 4 instructions.

    Pre-shifting them left would mean that you would be sure that the low 8 bits are zero. That means for packing, you can use only 4 instructions (thus 4 clock cycles) for packing instead of 5 instructions (and 5 clock cycles).

    I find questions like this one particularly interesting, puzzling things around so they fit exactly is probably the most fun way of optimizing.

Reply
  • Hi Vincent.

    Uhm, you can reduce the number of LLIs to a lot fewer. This is complicated, because it will require the LLIs to use tables and modify themselves.

    I've done this already (not for byte-stuffing, but for a demonstration on the NXP LPC1768).

    The attachment shows how I use the DMA to blink a LED. Not impressive in itself, but what's special is that I use a table in order to change speed.

    You could use tables in the same way to swap those bytes around (and even place them in a nice order).

    The DMA is quite fast, and it can work side-by-side with the CPU, but make sure that you wait until the DMA is finished, before a new FIR buffer is written.

    What you could do is that you send your FIR output to two different buffers, then while the FIR is writing to the current buffer, the DMA is fixing up the other buffer, which was written last time. You probably already know this; it's called double-buffering and is used both in preparing Audio and Video for output - but can also be applied here.

    About the stack pointer, yes, you have the option to use either the Main Stack Pointer (default) or the Process Stack Pointer. You need to use MSR to set up the PSP.

    > "The one register too short is too bad. I don't understand what you mean by squeeze register usage. You mean I could always load the last value after the first pack when I unroll by 3?"

    Sorry, I've been a bit unclear on that. What I mean is that you load for instance 10 registers, then process the first 8 registers, process the 9th register and then load 9 more registers, while keeping the 10th register. This might save you 1/3 clock cycle per block. Sometimes you can save some, sometimes it's just the same, so you'll need to create a good overview before you start.

    You have another alternative, though. The mask I was speaking about, can be loaded right after processing the very first register, as  you would have a register free.

    Unfortunately, most likely loading the mask will take 2 clock cycles, so it will only pay if you can do 3 chunks at a time.

    > "I must say your code is very efficient, at the point that I should start looking at improvements at other parts of my code. I don't think I can output the samples left aligned, since that would mean I would need to increase my multiplication, causing overflows. If I could, how would that help things?"

    Thank you; I think your original code is quite good as well; much better than what most implementations would look like.

    -My packing will only work correctly, if you are sure that your code delivers the upper 8 bits as zero always. If you know it does this, no problem, there's no need to change your FIR code.

    The instruction...

                        orrs                r3,r3,r5,lsr#8              /* [1] r3 = $BBBBDDDD */

    ...requires that the top 8 bits of r5 is zero, because it's mixing using bitwise OR. This is the only place I could not find a safer way to mix the two registers.

    So if r5 holds $xxDDDDdd, then the 'xx' would overwrite the top/mid byte of r3, which means the result would be $BBxxDDDD - we do not want that to happen.

    I imagine this would only be happening if your FIR results are signed integers, so that they would sometimes store $FF in the top byte and sometimes $00.

    But as I do not know the FIR code, I can't do any checking. You've probably already verified how it behaves regarding this.

    But if not, you could left-shift one of your factors by 8, then the overflow would not matter at all. Those 8 bits would just be shifted out (discarded). Your byte-stuffing would be slightly different, but I think it could still pack using 4 instructions.

    Pre-shifting them left would mean that you would be sure that the low 8 bits are zero. That means for packing, you can use only 4 instructions (thus 4 clock cycles) for packing instead of 5 instructions (and 5 clock cycles).

    I find questions like this one particularly interesting, puzzling things around so they fit exactly is probably the most fun way of optimizing.

Children
No data