Hello,
I have application running on an Cortex M3 MCU. The MCU runs a FIR filter which results in arrays of 32 bit values, but only the 24 least significant bits are used. Before storing this data to memory I want to compact four 24 bit words into three 32 bit words. I've written some inline assembly to do this. Its is very crucial that this operation goes as fast as possible. Unfortunately I am not an expert in this area. For example, I am wondering is a single LDM and STM is really the fastest option. Whether or not I can save a few operations, or possibly use 16 bit instructions instead of 32 bits. Or use move combined with shifts. Here is the C function (a more detailed instruction of what it actually does (/is supposed to do) follows at the end of this post).
static void compact(register q31_t * restrict in, register uint32_t * restrict out) { const q31_t * stop = in + FIR_OUT_BLOCK; register uint32_t t1, t2, t3, t4; while (in < stop) { __asm__ ( "ldmia %[in]! ,{%[t1], %[t2], %[t3], %[t4]} ;" // 1 "bfi %[t4] , %[t3], 24, 8 ;" // 2 "lsr %[t3] , %[t3], 8 ;" // 3 "bfi %[t3] , %[t2], 16, 16 ;" // 4 "lsr %[t2] , %[t2], 16 ;" // 5 "bfi %[t2] , %[t1], 8, 24 ;" // 6 "stmia %[out]!,{%[t2], %[t3], %[t4]} ;" // 7 : [in]"+r"(in), [out]"+r"(out), // input/output [t1]"=r"(t1), [t2]"=r"(t2), // temporary registers [t3]"=r"(t3), [t4]"=r"(t4) // : : "memory" ); } }
(I'm using GCC 4.7.)
My questions are whether or not the above code is close to optimal, and whether or not there are any errors in it. The provided array is always a multiple of 4 words. I've tested it with some test cases and it does seem to work, but any comments on the above function would be very much appreciated to help me with my understanding.
Regards,
Vincent
The function above converts four 32 bit words, having 24 bits of data Right aligned into three 32 bit words. Meaning parts of some words end up in other words. Notation: <word number><position>. e.g. 4H, means word no. 4, H -> highest byte of 24 bits, i.e. bits 16 to 23. 1) Starting condition: load 4 values into t1-t4 (registers) +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| | |2H:2M:2L| | |3H:3M:3L| | |4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 2) Bitfield insert LSByte t3 into t4 +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| | |2H:2M:2L| | |3H:3M:3L| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 3) Right shift by 8 bits t3 +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| | |2H:2M:2L| | | |3H:3M| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 4) Bitfield insert two LSBytes of t2 into 2 MSBytes of t3 +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| | |2H:2M:2L| |2M:2L|3H:3M| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 5) Right shift by 16 bits t2 +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| | | | |2H| |2M:2L|3H:3M| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 6) Bitfield insert +t1+--+--+--+ +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ | |1H:1M:1L| |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+ 7) Save back to memory, only t2, t3 and t4 +t2+--+--+--+ +t3+--+--+--+ +t4+--+--+--+ |1H:1M:1L|2H| |2M:2L|3H:3M| |3L|4H:4M:4L| +--+--+--+--+ +--+--+--+--+ +--+--+--+--+
Is it essential that the four values be laid out in bit order?
Without looking at it more closely, I'd bet that you could store the four values like this:
t1:7-0|t2, t1:15-8|t3, t1:23-16|t4
more quickly than the way you are storing them as:
t1:t2:23-16, t2:15-0|t3:23-8, t3:7-0|t4
Steve
As far as I understand, you should be able to get it one clock cycle faster by replacing stmia by multiple str instructions; the Cortex-M3 timing docs at ARM Information Center says:
STR Rx,[Ry,#imm] is always one cycle.
STR Rx,[Ry,#imm]
So since STMIA is 1 + N clock cycles, where N is the number of words stored, you should be able to use STR and gain one clock cycle.
So first step in optimization would be...
ldmia r0!,{r1-r4} /* [5] read four 32-bit words into r1-r4 */
bfi r4,r3,#24,#8 /* [1] insert bottom 8 bits from r3 into top 8 bits of r4 */
lsr r3,r3,#8 /* [1] remove those bits from r3 and make room for 16 bits */
bfi r3,r2,#16,#16 /* [1] insert bottom 16 bits from r2 into top 16 bits of r3 */
lsr r2,r2,#16 /* [1] remove those bits from r2 and make room for 24 bits */
bfi r2,r1,#8,#24 /* [1] insert the low 24 bits from r1 into the top 24 bits of r2 */
stmia r5!,{r1-r3} /* [4] store three 32-bit words and advance r5 */
/* (14 clock cycles for 12 bytes = 1.667 clock cycles per byte) */
...Unfortunately, after some testing, it turns out that the bfi instruction cannot cross the boundaries of a register.
That means the following code will not be accepted by your assembler:
ldmia r0!,{r1-r4} /* [5] */
bfi r1,r2,#24,#8 /* [1] */
bfi r2,r3,#24,#16 /* [1] */
bfi r3,r4,#24,#24 /* [1] */
str r1,[r5],#4 /* [1] */
str r2,[r5],#4 /* [1] */
str r3,[r5],#4 /* [1] */
/* (11 clock cycles for 12 bytes = 0.9167 clock cycles per byte) */
So we'll need a different approach. If you know that the top 8 bits of the registers are always zero, I think it can be done this way...
ldr r7,=0xffff0000 /* [2] before we begin, pre-load r7 with our AND-mask */
...
.rept 10 /* the code-block is repeated, in order to "unroll" it. */
ldmia r0!,{r2-r5} /* [5] read four 32-bit words */
bfi r2,r3,#24,#8 /* [1] r2 = $bbAAAAaa */
bfi r4,r5,#24,#8 /* [1] r4 = $ddCCCCcc */
ands r3,r7,r3,lsl#8 /* [1] r3 = $BBBB0000 */
orrs r3,r3,r5,lsr#8 /* [1] r3 = $BBBBDDDD */
str r2,[r1],#4 /* [1] store word 1 */
str r3,[r1],#4 /* [1] store word 2 */
str r4,[r1],#4 /* [1] store word 3 */
.endr
The above code is tested and seems to be working. Note: if you can store the bytes back at the same location, you can change ldmia to ldm and then save r1 for other things.
Of course it's a good idea to place a compare instruction at the end of the repeated code-block and then a conditional branch, to see if you've reached the end of the block you're converting.
It would also pay to save all registers before you start, and then use as many registers as possible, repeat the code-block, and finish up by restoring the registers.
Deller is correct. I did not look at it in that way, but it indeed does not matter how the words are packed into the three registers. The actual decoding is done offline and can take as much time as is needed.
I will look into the unrolling and possible register re-use. Unfortunately I do not exactly know how many samples will be provided to the routine in the future. For now it is fixed to 100. So, now this block is executed 25 times. This would mean I would need to repeat it 5 times, or indeed use more registers. For example, one block loading 3 x 4 samples (-> 12 registers), and one block loading 2 x 4 samples. However, when using registers >r12 it could be that it uses 32 bit instructions.
Thanks Deller and Jens for your insightful replies!
Do not worry about 16-bit or 32-bit instructions. This only affects the size, not the execution-speed (except that the IT instruction can be packed with a preceeding 16-bit instruction, but you don't need to use the IT instruction in this case).
-So..
str r12,[r14],#4 /* uses 1 clock cycle, but the instruction is a 32-bit instruction */
Note: I only gave you a part of the optimizing story.
One of the easiest ways of getting fairly quick packing is:
1: Save r4-r11 and LR on the stack
2: point one register to a saveSP variable in the .bss area and store the stack there. For instance:
ldr r3,=saveSP
str sp,[r3]
3: set up your source pointer in r0 and destination pointer in r1. Set up your and-mask in r2.
4: point SP to the ending address (eg. r0+(number_of_words_to_convert << 2))
5: loop:
6: .rept (many)
7: Load all registers using ldmia:
ldmia r0!,{r3-r12,lr}
8: pack the bytes using the 4 instruction method mentioned in the earlier post.
9: store using the STR instruction with update (eg. STR r3,[r1],#4)
10: .endr
11: cmp sp,r0 /* reached the end yet ? */
12: bne loop /* go back to step 5 */
13: restore stack:
ldr sp,[r3]
14: restore registers and return:
pop {r4-r11,pc}
(or if you have more work to do, before returning):
14: restore registers...
pop {r4-r11,lr}
Note: You'll be one register short, because of the AND-mask!
-So manually unroll 3 blocks and squeeze register-usage, so they use all the registers as efficiently as possible.
This should get you close to the best performance when using the CPU for packing.
Also try and see if you can modify the FIR code, in order to insert a free bit-shift, right before storing the value, so you'll shift the value up by 8 bits there.
-Or build the packing into the FIR right before storing the values, that would give you even better performance; you would save all the loading and storing then.
If your microcontroller has a DMA, you may be able to use Linked List Items (LLI) with memory-to-memory transfers, in order to pack the bytes even quicker.
Set up the DMA and linked lists when you initialize your microcontroller.
When you need to pack, start the DMA-channel that can do the packing (make sure you don't use the highest priority channel, or you'll stall the CPU-execution).
While the DMA is packing and shipping, the CPU will do the managing.
If you're storing the data on a SD/MMC card via SPI for instance, then you could perhaps instead get the DMA to write the data directly there while packing.
Hi Jens,
Thanks for your insightful answer. My DMA controller (I'm using an EFM32 Giant Gecko) has a scatter-gather mode, which in theory could probably do the same as the Linked List Items mode. However, it would require it to load a DMA descriptor each 3 byte transfers. Currently I transfer 32 bits at a time, directly to the NAND Flash controller, after packing. I'm not sure if that would be worth it, using the DMA controller. For a 100 samples, I would need to create a 100 descriptors. Each would copy 3 bytes. It would require 4 DMA cycles to load the descriptor, per word. The entire process would take 700 DMA transfers.
The trick with exploiting the stack-pointer as gp register is neat, but I will have some trouble if an interrupt happens during packing. However, I believe it is possible to use a separate stack-pointer for interrupts. Will need to look into that. Of course I can also disable interrupts during this process, but there is one process which requires some real-timeness. I will have to see whether or not this introduces an acceptable delay.
The one register too short is too bad. I don't understand what you mean by squeeze register usage. You mean I could always load the last value after the first pack when I unroll by 3?
I must say your code is very efficient, at the point that I should start looking at improvements at other parts of my code. I don't think I can output the samples left aligned, since that would mean I would need to increase my multiplication, causing overflows. If I could, how would that help things?
Anyway, I think your answers are very good, so I'll mark the last one as correct. Thank you for all the information!
Thanks!
Hi Vincent.
Uhm, you can reduce the number of LLIs to a lot fewer. This is complicated, because it will require the LLIs to use tables and modify themselves.
I've done this already (not for byte-stuffing, but for a demonstration on the NXP LPC1768).
The attachment shows how I use the DMA to blink a LED. Not impressive in itself, but what's special is that I use a table in order to change speed.
You could use tables in the same way to swap those bytes around (and even place them in a nice order).
The DMA is quite fast, and it can work side-by-side with the CPU, but make sure that you wait until the DMA is finished, before a new FIR buffer is written.
What you could do is that you send your FIR output to two different buffers, then while the FIR is writing to the current buffer, the DMA is fixing up the other buffer, which was written last time. You probably already know this; it's called double-buffering and is used both in preparing Audio and Video for output - but can also be applied here.
About the stack pointer, yes, you have the option to use either the Main Stack Pointer (default) or the Process Stack Pointer. You need to use MSR to set up the PSP.
> "The one register too short is too bad. I don't understand what you mean by squeeze register usage. You mean I could always load the last value after the first pack when I unroll by 3?"
Sorry, I've been a bit unclear on that. What I mean is that you load for instance 10 registers, then process the first 8 registers, process the 9th register and then load 9 more registers, while keeping the 10th register. This might save you 1/3 clock cycle per block. Sometimes you can save some, sometimes it's just the same, so you'll need to create a good overview before you start.
You have another alternative, though. The mask I was speaking about, can be loaded right after processing the very first register, as you would have a register free.
Unfortunately, most likely loading the mask will take 2 clock cycles, so it will only pay if you can do 3 chunks at a time.
> "I must say your code is very efficient, at the point that I should start looking at improvements at other parts of my code. I don't think I can output the samples left aligned, since that would mean I would need to increase my multiplication, causing overflows. If I could, how would that help things?"
Thank you; I think your original code is quite good as well; much better than what most implementations would look like.
-My packing will only work correctly, if you are sure that your code delivers the upper 8 bits as zero always. If you know it does this, no problem, there's no need to change your FIR code.
The instruction...
...requires that the top 8 bits of r5 is zero, because it's mixing using bitwise OR. This is the only place I could not find a safer way to mix the two registers.
So if r5 holds $xxDDDDdd, then the 'xx' would overwrite the top/mid byte of r3, which means the result would be $BBxxDDDD - we do not want that to happen.
I imagine this would only be happening if your FIR results are signed integers, so that they would sometimes store $FF in the top byte and sometimes $00.
But as I do not know the FIR code, I can't do any checking. You've probably already verified how it behaves regarding this.
But if not, you could left-shift one of your factors by 8, then the overflow would not matter at all. Those 8 bits would just be shifted out (discarded). Your byte-stuffing would be slightly different, but I think it could still pack using 4 instructions.
Pre-shifting them left would mean that you would be sure that the low 8 bits are zero. That means for packing, you can use only 4 instructions (thus 4 clock cycles) for packing instead of 5 instructions (and 5 clock cycles).
I find questions like this one particularly interesting, puzzling things around so they fit exactly is probably the most fun way of optimizing.