Recently I spoke about a LZ4 decompression routine I converted from 6502 code into a Arm Cortex-M0 code.
For some reason, I could not find my decompression routine, so I decided to convert it again. The result is below; the routine is now tested, bugfixed and works.
I've kept it as Cortex-M0 code, even for Cortex-M3 and Cortex-M4. The code also works on Cortex-A, ARM7 and ARM9. The Cortex-M3/Cortex-M4 version can be improved speed-wise, at the expense of extra bytes.
Since the size of flash memory on most Cortex-M0 microcontrollers is quite small, it makes sense to use a compression method where the decompression routine is small as well. In addition to being small, the LZ4 decompression routine is quick.
The idea of using compression is to "expand" your code space, so that you keep your code compressed in flash memory, then unpack a routine in SRAM and execute it from there. Of course you can also decompress data; it all depends on your needs.
I've kept the code "fairly easy to read", though the code is slightly size-optimized - it still does not use any macros. This means you'll be able to speed-optimize it easily by for instance improving the block-copy loops.
LZ4 is not complicated. Basically a compression block consists of one token byte followed by two different blocks of data.
The first byte is the token byte. It consists of two nibbles (that's two 4-bit values).
The first nibble of the token [8] tells the decompressor how many bytes to copy from the literal data section.
The second nibble of the token [f] holds the size of the match data minus 4; this tells the decompressor how many bytes to repeat from the already uncompressed data.
If the literal length is 15 (maximum value), then more length data follows the token (immediately). Each byte is added until the byte value read differs from 255.
(literal data follows the literal length)
The same applies to the match data length; except from that the minimum match data length is 4; thus we'll need to add 4 to the found length when decompressing.
(match offset follows the match length. A match offset of zero means end of compressed data).
The literal data should be copied directly to the buffer holding the uncompressed data.
Match data is read from the last write-position of the uncompressed data minus the match offset; it's then copied to the end of the uncompressed data.
This is repeated until the match offset is zero.
Thus in the above example, the literal length is [8], which means we'll need to copy 8 bytes from the literal data section to the output.
The match length is [f] (15, the maximum value for a nibble) which means more length information follows the match offset.
The match offset is [01 00] (low byte first, then high byte, so the match offset is 1; this is how many bytes we step back from the end of the uncompressed buffer.
The complete match length is [f] + 4 + [07] = 1a (26), this is how many bytes we copy from the match offset to the end of the uncompressed buffer.
That means our decompressed data will be: [31 36 61 0a 20 00 20 02] [02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02]
-You see, the byte [02] is first copied to the end of the uncompressed buffer, the source pointer is then advanced and now points to the byte we just wrote [02].
The destination pointer is also advanced and points right after the byte we just wrote, thusote [02] will be repeated.
If the match offset had been 3, then the last 3 bytes [00 20 02] would be repeated instead, resulting in the following uncompressed data:
[31 36 61 0a 20 00 20 02] [00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20]
.syntax unified .cpu cortex-m0 .thumb /* License: Public Domain - I cannot be held responsible for what it does or does not do if you use it, whether it's modified or not. */ /* Entry point = unlz4. On entry: r0 = source, r1 = destination. The first two bytes of the source must contain the length of the compressed data. */ .func unlz4 .global unlz4,unlz4_len .type unlz4,%function .type unlz4_len,%function .thumb_func unlz4: ldrh r2,[r0] /* get length of compressed data */ adds r0,r0,#2 /* advance source pointer */ .thumb_func unlz4_len: push {r4-r6,lr} /* save r4, r5, r6 and return-address */ adds r5,r2,r0 /* point r5 to end of compressed data */ getToken: ldrb r6,[r0] /* get token */ adds r0,r0,#1 /* advance source pointer */ lsrs r4,r6,#4 /* get literal length, keep token in r6 */ beq getOffset /* jump forward if there are no literals */ bl getLength /* get length of literals */ movs r2,r0 /* point r2 to literals */ bl copyData /* copy literals (r2=src, r1=dst, r4=len) */ movs r0,r2 /* update source pointer */ getOffset: ldrb r3,[r0,#0] /* get match offset's low byte */ subs r2,r1,r3 /* subtract from destination; this will become the match position */ ldrb r3,[r0,#1] /* get match offset's high byte */ lsls r3,r3,#8 /* shift to high byte */ subs r2,r2,r3 /* subtract from match position */ adds r0,r0,#2 /* advance source pointer */ lsls r4,r6,#28 /* get rid of token's high 28 bits */ lsrs r4,r4,#28 /* move the 4 low bits back where they were */ bl getLength /* get length of match data */ adds r4,r4,#4 /* minimum match length is 4 bytes */ bl copyData /* copy match data (r2=src, r1=dst, r4=len) */ cmp r0,r5 /* check if we've reached the end of the compressed data */ blt getToken /* if not, go get the next token */ pop {r4-r6,pc} /* restore r4, r5 and r6, then return */ .thumb_func getLength: cmp r4,#0x0f /* if length is 15, then more length info follows */ bne gotLength /* jump forward if we have the complete length */ getLengthLoop: ldrb r3,[r0] /* read another byte */ adds r0,r0,#1 /* advance source pointer */ adds r4,r4,r3 /* add byte to length */ cmp r3,#0xff /* check if end reached */ beq getLengthLoop /* if not, go round loop */ gotLength: bx lr /* return */ .thumb_func copyData: rsbs r4,r4,#0 /* index = -length */ subs r2,r2,r4 /* point to end of source */ subs r1,r1,r4 /* point to end of destination */ copyDataLoop: ldrb r3,[r2,r4] /* read byte from source_end[-index] */ strb r3,[r1,r4] /* store byte in destination_end[-index] */ adds r4,r4,#1 /* increment index */ bne copyDataLoop /* keep going until index wraps to 0 */ bx lr /* return */ .size unlz4,.-unlz4 .endfunc /* 42 narrow instructions = 84 bytes */
Since the data is going to be stored as compressed data in your flash memory, there's no need for a compression routine in your firmware.
Instead, you can prepare the data using your computer, for instance by using Yann Collet's tools. I recommend using the -9 switch for best possible compression.
You need to know that the 'lz4' compression tool insert a number of bytes as a header; this means you'll have to remove the first N bytes from the compressed file. The number of bytes are usually 11; but it could vary. A number of bytes follow the compressed data as well (this may include a checksum and other data).
When the data is compressed, you can use the GNU assembler's .incbin directive for including it in your firmware; or you could make a small tool in Perl to generate a .c file containing the compressed data as hex numbers in a static const array.
You can either place the length as two bytes in the beginning of the compressed block (before the first token) and point r0 to this length - or you can load the length of the compressed data into r2 and call unlz4_len instead.
The hexdump tool can be used for determining the length of the header:
hexdump -ve'"" 16/1 "%02x " "\n"' -n 12 bi1.lz4
04 22 4d 18 64 40 a7 31 43 00 00 f7
The first 4 bytes (04 22 4d 18) identifies the file as a LZ4 archive.
The next byte (64) is a flag byte, and this is the byte that we're interested in. If it has bit 3 set, then the header is 8 bytes longer than usual.
After the flag byte is a block descriptor byte (40)
Then a header checksum byte follows (a7)
A 32-bit word in Little Endian format follows. This is the length of the compressed data (0x00004331 = 17201 bytes)
Above I've dumped 12 bytes. The 12th byte is the first token byte of the compressed data (f7)
-So we can make a small perl one-liner, which finds out the size of the header:
hlen=`perl -e 'my ($b,$i,$f,$j,$l);if(read(STDIN,$b,11)){ ($i,$f,$j,$l) = unpack("H8CA2V", $b); } print(("$i" eq "04224d18") ? (11+($f & 8)) . " count=$l" : 0,"\n");' < bi1.lz4`
... now the shell variable $hlen contains the result, which is 0 if the file is not a lz4 file, but if it's a lz4 file, hlen is 11 or 19 followed by a space and "count=" and the size of the compressed data.
You can use the command-line tool 'dd' to cut the binary file:
dd if=bi1.lz4 of=bi1.bin bs=1 iseek=$hlen
Here's a small perl one-liner to convert a binary file to a .c file (replace bi1.lz4 and mySource.c at the end:
perl -e 'print("static const uint8_t sMyCompressedData[] = {\n" ); my $b; while(my $l=read(STDIN, $b, 16)){ printf("\t"); $b =~ s/(.)/printf("0x%02x, ", ord($1))/seg; print("\n");};print("};\n");' < bi1.bin >mySource.c
The above 3 one-liners can easily be converted or combined into a bash-script, which will extract the compressed data to its own file - or you can download the lz4cut script, which I've attached below.
Using the decompressor is quite easy. When calling the routine, you just need to point r0 to the compressed data and r1 to the address in RAM, where you want the decompressed data to go.
So in assembly language you could do it this way:
ldr r0,=ladybug256_lz4 /* point r0 to the compressed data */ ldr r1,=screen /* point r1 to where the decompressed data should go */ bl unlz4 /* decompress the data */
To include the binary data, you can use the .incbin directive:
.align 1 /* make sure the length halfword is placed on an even address */ ladybug256_lz4: .2byte (ladybug256_lz4_end - ladybug256_lz4) /* generate a length value in front of the data */ .incbin 'ladybug256.lz4' /* read binary file into this location */ ladybug256_lz4_end: /* end of compressed data */
If you want to use it from C or C++, you need to declare a function prototype:
void unlz4(const void *aSource, void *aDestination); void unlz4_len(const void *aSource, void *aDestination, uint32_t aLength);
Then you can call it this way:
unlz4(ladybug256_lz4, screen); /* decompress our data directly to the screen */
... or, if you think it's too tedious to include the length as the first 16 bit value in front of the compressed data ...
unlz4_len(castle_lz4, screen, sizeof(castle_lz4)); /* decompress our data directly to the screen */
This should allow you to squeeze more code and data into the flash memory of your Cortex-M based devices.
You can also use it with external SPI flash, because most Cortex-M microcontrollers can map external SPI Flash directly to your memory address space.
Update: Aug. 22: 2016: I've shaved off one clock cycle of the loop inside the copyData subroutine. The size of the routine is the same, but it performs better as there's one instruction less inside the loop. This change might make dramatic improvements for some use cases (especially real-time decompression. I recommend going through the copyData subroutine step-by-step, to see how it differs from a standard copy routine. If speed is an issue, remember that it's possible to unroll the subroutine.
Note: I've also added some one-liners and attached a full script in order to help you automate converting the files.
Great article. Just found it.
Actually it resembles TurboPacker which existed long long time ago on the Atari ST.
It has a similar compression ratio but does not support variable-length encoding (always max. 15+3 bytes).
Because the compressor is quite a lot more complex than the decompressor, I haven't made any attempt to write a compressor.
The C-source is available for the compressor, so if you really need a compressor that runs on a Cortex-M microcontroller, I believe the easiest/best way would to build it for the specific architecture that you need. You would probably prefer using -Os for size optimization, in order to keep the size of the compressor down at a minimum.
If I were to create a compressor, there's additional problems: Where is the data source and where would the compressed data go ?
-Memory ?
-Flash ?!??
-SD-card
-Ethernet using FTP / HTTP ?
... There are a lot of choices; the above is only a fraction, but I believe a callback routine would probably be the best solution if a compressor is implemented on the chip.
As mentioned earlier, I really recommend using the command-line tools on your PC/Mac/Other desktop computer to compress the data and then store the binary data either in Flash memory or perhaps on SD-card.
If you're interfacing to a video camera, of course a compression routine would be interesting.
Having a 'pass-through' style callback subroutine interface / daisy chain would probably be a good approach in such cases, but I believe that you would only be able to capture still images, not "live video", due to the long processing time on compressing the data.
Nice work Jens
Just out of curiosity what about the compressor itself?
It would be nice if you could do that for the compressor. I'm not that familiar with ARM assembly!
I imagine that this will probably be used a lot for compressing pictures.
-Large data may require a minimum of 64K decompression buffer (I just tried with a 192K picture, and it requires at least a 64K buffer in RAM).
But if the compression utility could limit the size of the decompression buffer, it would be possible to decompress large pictures in small chunks, thus sending the chunks to for instance a SPI based TFT display, even though the picture is larger than the available SRAM.
Maybe Yann can be convinced to add a limit for this in the lz4 command-line utility (I haven't asked yet).
But for devices, which has more than 64K RAM available, it's possible to make a decompressor, which can decompress files of virtually any size.
This is because the match offset cannot be larger than 64K, since it's a two-byte value.
nice idea jens