Recently I spoke about a LZ4 decompression routine I converted from 6502 code into a Arm Cortex-M0 code.
For some reason, I could not find my decompression routine, so I decided to convert it again. The result is below; the routine is now tested, bugfixed and works.
I've kept it as Cortex-M0 code, even for Cortex-M3 and Cortex-M4. The code also works on Cortex-A, ARM7 and ARM9. The Cortex-M3/Cortex-M4 version can be improved speed-wise, at the expense of extra bytes.
Since the size of flash memory on most Cortex-M0 microcontrollers is quite small, it makes sense to use a compression method where the decompression routine is small as well. In addition to being small, the LZ4 decompression routine is quick.
The idea of using compression is to "expand" your code space, so that you keep your code compressed in flash memory, then unpack a routine in SRAM and execute it from there. Of course you can also decompress data; it all depends on your needs.
I've kept the code "fairly easy to read", though the code is slightly size-optimized - it still does not use any macros. This means you'll be able to speed-optimize it easily by for instance improving the block-copy loops.
LZ4 is not complicated. Basically a compression block consists of one token byte followed by two different blocks of data.
The first byte is the token byte. It consists of two nibbles (that's two 4-bit values).
The first nibble of the token [8] tells the decompressor how many bytes to copy from the literal data section.
The second nibble of the token [f] holds the size of the match data minus 4; this tells the decompressor how many bytes to repeat from the already uncompressed data.
If the literal length is 15 (maximum value), then more length data follows the token (immediately). Each byte is added until the byte value read differs from 255.
(literal data follows the literal length)
The same applies to the match data length; except from that the minimum match data length is 4; thus we'll need to add 4 to the found length when decompressing.
(match offset follows the match length. A match offset of zero means end of compressed data).
The literal data should be copied directly to the buffer holding the uncompressed data.
Match data is read from the last write-position of the uncompressed data minus the match offset; it's then copied to the end of the uncompressed data.
This is repeated until the match offset is zero.
Thus in the above example, the literal length is [8], which means we'll need to copy 8 bytes from the literal data section to the output.
The match length is [f] (15, the maximum value for a nibble) which means more length information follows the match offset.
The match offset is [01 00] (low byte first, then high byte, so the match offset is 1; this is how many bytes we step back from the end of the uncompressed buffer.
The complete match length is [f] + 4 + [07] = 1a (26), this is how many bytes we copy from the match offset to the end of the uncompressed buffer.
That means our decompressed data will be: [31 36 61 0a 20 00 20 02] [02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02]
-You see, the byte [02] is first copied to the end of the uncompressed buffer, the source pointer is then advanced and now points to the byte we just wrote [02].
The destination pointer is also advanced and points right after the byte we just wrote, thusote [02] will be repeated.
If the match offset had been 3, then the last 3 bytes [00 20 02] would be repeated instead, resulting in the following uncompressed data:
[31 36 61 0a 20 00 20 02] [00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20]
.syntax unified .cpu cortex-m0 .thumb /* License: Public Domain - I cannot be held responsible for what it does or does not do if you use it, whether it's modified or not. */ /* Entry point = unlz4. On entry: r0 = source, r1 = destination. The first two bytes of the source must contain the length of the compressed data. */ .func unlz4 .global unlz4,unlz4_len .type unlz4,%function .type unlz4_len,%function .thumb_func unlz4: ldrh r2,[r0] /* get length of compressed data */ adds r0,r0,#2 /* advance source pointer */ .thumb_func unlz4_len: push {r4-r6,lr} /* save r4, r5, r6 and return-address */ adds r5,r2,r0 /* point r5 to end of compressed data */ getToken: ldrb r6,[r0] /* get token */ adds r0,r0,#1 /* advance source pointer */ lsrs r4,r6,#4 /* get literal length, keep token in r6 */ beq getOffset /* jump forward if there are no literals */ bl getLength /* get length of literals */ movs r2,r0 /* point r2 to literals */ bl copyData /* copy literals (r2=src, r1=dst, r4=len) */ movs r0,r2 /* update source pointer */ getOffset: ldrb r3,[r0,#0] /* get match offset's low byte */ subs r2,r1,r3 /* subtract from destination; this will become the match position */ ldrb r3,[r0,#1] /* get match offset's high byte */ lsls r3,r3,#8 /* shift to high byte */ subs r2,r2,r3 /* subtract from match position */ adds r0,r0,#2 /* advance source pointer */ lsls r4,r6,#28 /* get rid of token's high 28 bits */ lsrs r4,r4,#28 /* move the 4 low bits back where they were */ bl getLength /* get length of match data */ adds r4,r4,#4 /* minimum match length is 4 bytes */ bl copyData /* copy match data (r2=src, r1=dst, r4=len) */ cmp r0,r5 /* check if we've reached the end of the compressed data */ blt getToken /* if not, go get the next token */ pop {r4-r6,pc} /* restore r4, r5 and r6, then return */ .thumb_func getLength: cmp r4,#0x0f /* if length is 15, then more length info follows */ bne gotLength /* jump forward if we have the complete length */ getLengthLoop: ldrb r3,[r0] /* read another byte */ adds r0,r0,#1 /* advance source pointer */ adds r4,r4,r3 /* add byte to length */ cmp r3,#0xff /* check if end reached */ beq getLengthLoop /* if not, go round loop */ gotLength: bx lr /* return */ .thumb_func copyData: rsbs r4,r4,#0 /* index = -length */ subs r2,r2,r4 /* point to end of source */ subs r1,r1,r4 /* point to end of destination */ copyDataLoop: ldrb r3,[r2,r4] /* read byte from source_end[-index] */ strb r3,[r1,r4] /* store byte in destination_end[-index] */ adds r4,r4,#1 /* increment index */ bne copyDataLoop /* keep going until index wraps to 0 */ bx lr /* return */ .size unlz4,.-unlz4 .endfunc /* 42 narrow instructions = 84 bytes */
Since the data is going to be stored as compressed data in your flash memory, there's no need for a compression routine in your firmware.
Instead, you can prepare the data using your computer, for instance by using Yann Collet's tools. I recommend using the -9 switch for best possible compression.
You need to know that the 'lz4' compression tool insert a number of bytes as a header; this means you'll have to remove the first N bytes from the compressed file. The number of bytes are usually 11; but it could vary. A number of bytes follow the compressed data as well (this may include a checksum and other data).
When the data is compressed, you can use the GNU assembler's .incbin directive for including it in your firmware; or you could make a small tool in Perl to generate a .c file containing the compressed data as hex numbers in a static const array.
You can either place the length as two bytes in the beginning of the compressed block (before the first token) and point r0 to this length - or you can load the length of the compressed data into r2 and call unlz4_len instead.
The hexdump tool can be used for determining the length of the header:
hexdump -ve'"" 16/1 "%02x " "\n"' -n 12 bi1.lz4
04 22 4d 18 64 40 a7 31 43 00 00 f7
The first 4 bytes (04 22 4d 18) identifies the file as a LZ4 archive.
The next byte (64) is a flag byte, and this is the byte that we're interested in. If it has bit 3 set, then the header is 8 bytes longer than usual.
After the flag byte is a block descriptor byte (40)
Then a header checksum byte follows (a7)
A 32-bit word in Little Endian format follows. This is the length of the compressed data (0x00004331 = 17201 bytes)
Above I've dumped 12 bytes. The 12th byte is the first token byte of the compressed data (f7)
-So we can make a small perl one-liner, which finds out the size of the header:
hlen=`perl -e 'my ($b,$i,$f,$j,$l);if(read(STDIN,$b,11)){ ($i,$f,$j,$l) = unpack("H8CA2V", $b); } print(("$i" eq "04224d18") ? (11+($f & 8)) . " count=$l" : 0,"\n");' < bi1.lz4`
... now the shell variable $hlen contains the result, which is 0 if the file is not a lz4 file, but if it's a lz4 file, hlen is 11 or 19 followed by a space and "count=" and the size of the compressed data.
You can use the command-line tool 'dd' to cut the binary file:
dd if=bi1.lz4 of=bi1.bin bs=1 iseek=$hlen
Here's a small perl one-liner to convert a binary file to a .c file (replace bi1.lz4 and mySource.c at the end:
perl -e 'print("static const uint8_t sMyCompressedData[] = {\n" ); my $b; while(my $l=read(STDIN, $b, 16)){ printf("\t"); $b =~ s/(.)/printf("0x%02x, ", ord($1))/seg; print("\n");};print("};\n");' < bi1.bin >mySource.c
The above 3 one-liners can easily be converted or combined into a bash-script, which will extract the compressed data to its own file - or you can download the lz4cut script, which I've attached below.
Using the decompressor is quite easy. When calling the routine, you just need to point r0 to the compressed data and r1 to the address in RAM, where you want the decompressed data to go.
So in assembly language you could do it this way:
ldr r0,=ladybug256_lz4 /* point r0 to the compressed data */ ldr r1,=screen /* point r1 to where the decompressed data should go */ bl unlz4 /* decompress the data */
To include the binary data, you can use the .incbin directive:
.align 1 /* make sure the length halfword is placed on an even address */ ladybug256_lz4: .2byte (ladybug256_lz4_end - ladybug256_lz4) /* generate a length value in front of the data */ .incbin 'ladybug256.lz4' /* read binary file into this location */ ladybug256_lz4_end: /* end of compressed data */
If you want to use it from C or C++, you need to declare a function prototype:
void unlz4(const void *aSource, void *aDestination); void unlz4_len(const void *aSource, void *aDestination, uint32_t aLength);
Then you can call it this way:
unlz4(ladybug256_lz4, screen); /* decompress our data directly to the screen */
... or, if you think it's too tedious to include the length as the first 16 bit value in front of the compressed data ...
unlz4_len(castle_lz4, screen, sizeof(castle_lz4)); /* decompress our data directly to the screen */
This should allow you to squeeze more code and data into the flash memory of your Cortex-M based devices.
You can also use it with external SPI flash, because most Cortex-M microcontrollers can map external SPI Flash directly to your memory address space.
Update: Aug. 22: 2016: I've shaved off one clock cycle of the loop inside the copyData subroutine. The size of the routine is the same, but it performs better as there's one instruction less inside the loop. This change might make dramatic improvements for some use cases (especially real-time decompression. I recommend going through the copyData subroutine step-by-step, to see how it differs from a standard copy routine. If speed is an issue, remember that it's possible to unroll the subroutine.
Note: I've also added some one-liners and attached a full script in order to help you automate converting the files.
Does your code support Streaming API?
I think you can save a possible stall in your copy routine by reorganising the instructions and preparing the end pointer to point to the final byte.
copyData:
rsbs r4,r4,#0 /* index = -length */ subs r2,r2,r4 /* point to end of source */ subs r1,r1,r4 /* point to end of destination */ subs r1,r1,#1 /* point back one byte (new) */
copyDataLoop:
ldrb r3,[r2,r4] /* read byte from source_end[-index] */ adds r4,r4,#1 /* increment index (reordered) */ strb r3,[r1,r4] /* store byte in destination_end[-index] */ bne copyDataLoop /* keep going until index wraps to 0 */
For the cost of 1 instruction for the setup, you're saving the plpeline stall that happens because the data isn't available in r3 yet. Obviously depends on the chip that you're using as to whether that's effective or not, but my vague recollections of ARM implies that it should be more efficient, and doesn't involve any unrolling.
So far, I've done some code-rewriting that confirms it's possible to add the check by only adding two bytes extra. I'm hoping that I can find a way to get it further reduced by 2 bytes, though. ;)
Note: it's only necessary to do the check at getOffset, so the other check can be deleted.
Is it possible to get the code for compressor for M3? I need to test compression and decompression on STM32L1 for a demo project. Can you also tell me what is the memory footprint for compression and decompression. Thanks
Regarding the issue @kimstik pointed out, I've come across the same problem, and placing a couple of instructions at getOffset helped.
So my code looks like: (modification in bold typeface)
getOffset: cmp r0,r5 /* check if we've reached the end of the compressed data */ bge bye /* if so, don't expect match offset any more. quit now */ ldrb r3,[r0,#0] /* get match offset's low byte */
... and by the end of the function, inserted a label:
bye: pop {r4-r6,pc} /* restore r4, r5 and r6, then return */
Additionally, I thought it would be useful if the function returns deflated data size, so here are the results:
unlz4_len: push {r4-r7,lr} /* save r4, r5, r6, r7 and return-address */ mov r7,r1 /* store destination address, to get output size later */
... and by the end of the function, again:
bye: subs r0,r1,r7 /* get output size */ pop {r4-r7,pc} /* restore r4, r5 r6 and r7, then return */
Note that in this case, the declarations for functions unlz4 and unlz4_len should also be modified to type int, obviously.
Enjoy!