LZ4 decompression routine for Cortex-M0 and later

August 22, 2016

10 minute read time.

Introduction

Recently I spoke about a LZ4 decompression routine I converted from 6502 code into a Arm Cortex-M0 code.

For some reason, I could not find my decompression routine, so I decided to convert it again. The result is below; the routine is now tested, bugfixed and works.

I've kept it as Cortex-M0 code, even for Cortex-M3 and Cortex-M4. The code also works on Cortex-A, ARM7 and ARM9. The Cortex-M3/Cortex-M4 version can be improved speed-wise, at the expense of extra bytes.

Why use LZ4 compression ?

Since the size of flash memory on most Cortex-M0 microcontrollers is quite small, it makes sense to use a compression method where the decompression routine is small as well. In addition to being small, the LZ4 decompression routine is quick.

The idea of using compression is to "expand" your code space, so that you keep your code compressed in flash memory, then unpack a routine in SRAM and execute it from there. Of course you can also decompress data; it all depends on your needs.

I've kept the code "fairly easy to read", though the code is slightly size-optimized - it still does not use any macros. This means you'll be able to speed-optimize it easily by for instance improving the block-copy loops.

The compressed data

LZ4 is not complicated. Basically a compression block consists of one token byte followed by two different blocks of data.

[Token byte] [Literal length] [Literal data] [Match offset] [Match length]

Example:

[8f] [no additional length info] [31 36 61 0a 20 00 20 02] [01 00] [07]

The first byte is the token byte. It consists of two nibbles (that's two 4-bit values).

The first nibble of the token [8] tells the decompressor how many bytes to copy from the literal data section.

The second nibble of the token [f] holds the size of the match data minus 4; this tells the decompressor how many bytes to repeat from the already uncompressed data.

If the literal length is 15 (maximum value), then more length data follows the token (immediately). Each byte is added until the byte value read differs from 255.

(literal data follows the literal length)

The same applies to the match data length; except from that the minimum match data length is 4; thus we'll need to add 4 to the found length when decompressing.

(match offset follows the match length. A match offset of zero means end of compressed data).

The literal data should be copied directly to the buffer holding the uncompressed data.

Match data is read from the last write-position of the uncompressed data minus the match offset; it's then copied to the end of the uncompressed data.

This is repeated until the match offset is zero.

Thus in the above example, the literal length is [8], which means we'll need to copy 8 bytes from the literal data section to the output.

The match length is [f] (15, the maximum value for a nibble) which means more length information follows the match offset.

The literal data section contains the bytes: [31 36 61 0a 20 00 20 02] - we copy those directly to the end of the uncompressed data buffer.

The match offset is [01 00] (low byte first, then high byte, so the match offset is 1; this is how many bytes we step back from the end of the uncompressed buffer.

The complete match length is [f] + 4 + [07] = 1a (26), this is how many bytes we copy from the match offset to the end of the uncompressed buffer.

That means our decompressed data will be: [31 36 61 0a 20 00 20 02] [02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02]

-You see, the byte [02] is first copied to the end of the uncompressed buffer, the source pointer is then advanced and now points to the byte we just wrote [02].

The destination pointer is also advanced and points right after the byte we just wrote, thusote [02] will be repeated.

If the match offset had been 3, then the last 3 bytes [00 20 02] would be repeated instead, resulting in the following uncompressed data:

[31 36 61 0a 20 00 20 02] [00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20]

The code

    .syntax             unified

                    .cpu                cortex-m0

                    .thumb

/* License: Public Domain - I cannot be held responsible for what it does or does not do if you use it, whether it's modified or not. */

/* Entry point = unlz4. On entry: r0 = source, r1 = destination. The first two bytes of the source must contain the length of the compressed data. */

                    .func               unlz4

                    .global             unlz4,unlz4_len

                    .type               unlz4,%function

                    .type               unlz4_len,%function

                    .thumb_func

unlz4:              ldrh                r2,[r0]             /* get length of compressed data */

                    adds                r0,r0,#2            /* advance source pointer */

                    .thumb_func

unlz4_len:          push                {r4-r6,lr}          /* save r4, r5, r6 and return-address */

                    adds                r5,r2,r0            /* point r5 to end of compressed data */

getToken:           ldrb                r6,[r0]             /* get token */

                    adds                r0,r0,#1            /* advance source pointer */

                    lsrs                r4,r6,#4            /* get literal length, keep token in r6 */

                    beq                 getOffset           /* jump forward if there are no literals */

                    bl                  getLength           /* get length of literals */

                    movs                r2,r0               /* point r2 to literals */

                    bl                  copyData            /* copy literals (r2=src, r1=dst, r4=len) */

                    movs                r0,r2               /* update source pointer */

getOffset:          ldrb                r3,[r0,#0]          /* get match offset's low byte */

                    subs                r2,r1,r3            /* subtract from destination; this will become the match position */

                    ldrb                r3,[r0,#1]          /* get match offset's high byte */

                    lsls                r3,r3,#8            /* shift to high byte */

                    subs                r2,r2,r3            /* subtract from match position */

                    adds                r0,r0,#2            /* advance source pointer */

                    lsls                r4,r6,#28           /* get rid of token's high 28 bits */

                    lsrs                r4,r4,#28           /* move the 4 low bits back where they were */

                    bl                  getLength           /* get length of match data */

                    adds                r4,r4,#4            /* minimum match length is 4 bytes */

                    bl                  copyData            /* copy match data (r2=src, r1=dst, r4=len) */

                    cmp                 r0,r5               /* check if we've reached the end of the compressed data */

                    blt                 getToken            /* if not, go get the next token */

                    pop                 {r4-r6,pc}          /* restore r4, r5 and r6, then return */

                    .thumb_func

getLength:          cmp                 r4,#0x0f            /* if length is 15, then more length info follows */

                    bne                 gotLength           /* jump forward if we have the complete length */

getLengthLoop:      ldrb                r3,[r0]             /* read another byte */

                    adds                r0,r0,#1            /* advance source pointer */

                    adds                r4,r4,r3            /* add byte to length */

                    cmp                 r3,#0xff            /* check if end reached */

                    beq                 getLengthLoop       /* if not, go round loop */

gotLength:          bx                  lr                  /* return */

                    .thumb_func

copyData:           rsbs                r4,r4,#0            /* index = -length */

                    subs                r2,r2,r4            /* point to end of source */

                    subs                r1,r1,r4            /* point to end of destination */

copyDataLoop:       ldrb                r3,[r2,r4]          /* read byte from source_end[-index] */

                    strb                r3,[r1,r4]          /* store byte in destination_end[-index] */

                    adds                r4,r4,#1            /* increment index */

                    bne                 copyDataLoop        /* keep going until index wraps to 0 */

                    bx                  lr                  /* return */

                    .size               unlz4,.-unlz4

                    .endfunc

/* 42 narrow instructions = 84 bytes */

Compressing your data

Since the data is going to be stored as compressed data in your flash memory, there's no need for a compression routine in your firmware.

Instead, you can prepare the data using your computer, for instance by using Yann Collet's tools. I recommend using the -9 switch for best possible compression.

You need to know that the 'lz4' compression tool insert a number of bytes as a header; this means you'll have to remove the first N bytes from the compressed file. The number of bytes are usually 11; but it could vary. A number of bytes follow the compressed data as well (this may include a checksum and other data).

When the data is compressed, you can use the GNU assembler's .incbin directive for including it in your firmware; or you could make a small tool in Perl to generate a .c file containing the compressed data as hex numbers in a static const array.

You can either place the length as two bytes in the beginning of the compressed block (before the first token) and point r0 to this length - or you can load the length of the compressed data into r2 and call unlz4_len instead.

The hexdump tool can be used for determining the length of the header:

hexdump -ve'"" 16/1 "%02x " "\n"' -n 12 bi1.lz4

04 22 4d 18 64 40 a7 31 43 00 00 f7

The first 4 bytes (04 22 4d 18) identifies the file as a LZ4 archive.

The next byte (64) is a flag byte, and this is the byte that we're interested in. If it has bit 3 set, then the header is 8 bytes longer than usual.

After the flag byte is a block descriptor byte (40)

Then a header checksum byte follows (a7)

A 32-bit word in Little Endian format follows. This is the length of the compressed data (0x00004331 = 17201 bytes)

Above I've dumped 12 bytes. The 12th byte is the first token byte of the compressed data (f7)

-So we can make a small perl one-liner, which finds out the size of the header:

hlen=`perl -e 'my ($b,$i,$f,$j,$l);if(read(STDIN,$b,11)){ ($i,$f,$j,$l) = unpack("H8CA2V", $b); } print(("$i" eq "04224d18") ? (11+($f & 8)) . " count=$l" : 0,"\n");' < bi1.lz4`

... now the shell variable $hlen contains the result, which is 0 if the file is not a lz4 file, but if it's a lz4 file, hlen is 11 or 19 followed by a space and "count=" and the size of the compressed data.

You can use the command-line tool 'dd' to cut the binary file:

dd if=bi1.lz4 of=bi1.bin bs=1 iseek=$hlen

Here's a small perl one-liner to convert a binary file to a .c file (replace bi1.lz4 and mySource.c at the end:

perl -e 'print("static const uint8_t sMyCompressedData[] = {\n" ); my $b; while(my $l=read(STDIN, $b, 16)){ printf("\t"); $b =~ s/(.)/printf("0x%02x, ", ord($1))/seg; print("\n");};print("};\n");' < bi1.bin >mySource.c

The above 3 one-liners can easily be converted or combined into a bash-script, which will extract the compressed data to its own file - or you can download the lz4cut script, which I've attached below.

Using the decompressor

Using the decompressor is quite easy. When calling the routine, you just need to point r0 to the compressed data and r1 to the address in RAM, where you want the decompressed data to go.

So in assembly language you could do it this way:

       ldr                 r0,=ladybug256_lz4  /* point r0 to the compressed data */

                    ldr                 r1,=screen          /* point r1 to where the decompressed data should go */

                    bl                  unlz4               /* decompress the data */

To include the binary data, you can use the .incbin directive:

                    .align              1                   /* make sure the length halfword is placed on an even address */

ladybug256_lz4:     .2byte              (ladybug256_lz4_end - ladybug256_lz4) /* generate a length value in front of the data */

                    .incbin             'ladybug256.lz4'    /* read binary file into this location */

ladybug256_lz4_end:                                         /* end of compressed data */

If you want to use it from C or C++, you need to declare a function prototype:

void unlz4(const void *aSource, void *aDestination);

void unlz4_len(const void *aSource, void *aDestination, uint32_t aLength);

Then you can call it this way:

unlz4(ladybug256_lz4, screen); /* decompress our data directly to the screen */

... or, if you think it's too tedious to include the length as the first 16 bit value in front of the compressed data ...

unlz4_len(castle_lz4, screen, sizeof(castle_lz4)); /* decompress our data directly to the screen */

This should allow you to squeeze more code and data into the flash memory of your Cortex-M based devices.

You can also use it with external SPI flash, because most Cortex-M microcontrollers can map external SPI Flash directly to your memory address space.

Update: Aug. 22: 2016: I've shaved off one clock cycle of the loop inside the copyData subroutine. The size of the routine is the same, but it performs better as there's one instruction less inside the loop. This change might make dramatic improvements for some use cases (especially real-time decompression. I recommend going through the copyData subroutine step-by-step, to see how it differs from a standard copy routine. If speed is an issue, remember that it's possible to unroll the subroutine.

Note: I've also added some one-liners and attached a full script in order to help you automate converting the files.

lz4cut.tar.bz2

Top Comments

SeregaKai over 4 years ago

Does your code support Streaming API?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
gerph over 5 years ago

I think you can save a possible stall in your copy routine by reorganising the instructions and preparing the end pointer to point to the final byte.

copyData:

   rsbs r4,r4,#0 /* index = -length */
   subs r2,r2,r4 /* point to end of source */
   subs r1,r1,r4 /* point to end of destination */
   subs r1,r1,#1 /* point back one byte (new) */

copyDataLoop:

   ldrb r3,[r2,r4] /* read byte from source_end[-index] */
   adds r4,r4,#1 /* increment index (reordered) */
   strb r3,[r1,r4] /* store byte in destination_end[-index] */
   bne copyDataLoop /* keep going until index wraps to 0 */

For the cost of 1 instruction for the setup, you're saving the plpeline stall that happens because the data isn't available in r3 yet. Obviously depends on the chip that you're using as to whether that's effective or not, but my vague recollections of ARM implies that it should be more efficient, and doesn't involve any unrolling.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jens Bauer over 7 years ago in reply to dongmok

So far, I've done some code-rewriting that confirms it's possible to add the check by only adding two bytes extra. I'm hoping that I can find a way to get it further reduced by 2 bytes, though. ;)

Note: it's only necessary to do the check at getOffset, so the other check can be deleted.
- Cancel
- Up +1 Down
- Reply
- More
- Cancel
Andrea Zanotto over 7 years ago

Is it possible to get the code for compressor for M3? I need to test compression and decompression on STM32L1 for a demo project. Can you also tell me what is the memory footprint for compression and decompression. Thanks
- Cancel
- Up -1 Down
- Reply
- More
- Cancel
dongmok over 7 years ago in reply to Jens Bauer

Regarding the issue @kimstik pointed out, I've come across the same problem, and placing a couple of instructions at getOffset helped.

So my code looks like: (modification in bold typeface)

getOffset: cmp r0,r5 /* check if we've reached the end of the compressed data */
  bge bye /* if so, don't expect match offset any more. quit now */
ldrb r3,[r0,#0] /* get match offset's low byte */

... and by the end of the function, inserted a label:

bye:    pop {r4-r6,pc} /* restore r4, r5 and r6, then return */

Additionally, I thought it would be useful if the function returns deflated data size, so here are the results:

unlz4_len: push {r4-r7,lr} /* save r4, r5, r6, r7 and return-address */
mov r7,r1 /* store destination address, to get output size later */

... and by the end of the function, again:

bye:     subs r0,r1,r7 /* get output size */
pop {r4-r7,pc} /* restore r4, r5 r6 and r7, then return */

Note that in this case, the declarations for functions unlz4 and unlz4_len should also be modified to type int, obviously.

Enjoy!
- Cancel
- Up +2 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog