LZ4 decompression routine for Cortex-M0 and later

August 22, 2016

10 minute read time.

Introduction

Recently I spoke about a LZ4 decompression routine I converted from 6502 code into a Arm Cortex-M0 code.

For some reason, I could not find my decompression routine, so I decided to convert it again. The result is below; the routine is now tested, bugfixed and works.

I've kept it as Cortex-M0 code, even for Cortex-M3 and Cortex-M4. The code also works on Cortex-A, ARM7 and ARM9. The Cortex-M3/Cortex-M4 version can be improved speed-wise, at the expense of extra bytes.

Why use LZ4 compression ?

Since the size of flash memory on most Cortex-M0 microcontrollers is quite small, it makes sense to use a compression method where the decompression routine is small as well. In addition to being small, the LZ4 decompression routine is quick.

The idea of using compression is to "expand" your code space, so that you keep your code compressed in flash memory, then unpack a routine in SRAM and execute it from there. Of course you can also decompress data; it all depends on your needs.

I've kept the code "fairly easy to read", though the code is slightly size-optimized - it still does not use any macros. This means you'll be able to speed-optimize it easily by for instance improving the block-copy loops.

The compressed data

LZ4 is not complicated. Basically a compression block consists of one token byte followed by two different blocks of data.

[Token byte] [Literal length] [Literal data] [Match offset] [Match length]

Example:

[8f] [no additional length info] [31 36 61 0a 20 00 20 02] [01 00] [07]

The first byte is the token byte. It consists of two nibbles (that's two 4-bit values).

The first nibble of the token [8] tells the decompressor how many bytes to copy from the literal data section.

The second nibble of the token [f] holds the size of the match data minus 4; this tells the decompressor how many bytes to repeat from the already uncompressed data.

If the literal length is 15 (maximum value), then more length data follows the token (immediately). Each byte is added until the byte value read differs from 255.

(literal data follows the literal length)

The same applies to the match data length; except from that the minimum match data length is 4; thus we'll need to add 4 to the found length when decompressing.

(match offset follows the match length. A match offset of zero means end of compressed data).

The literal data should be copied directly to the buffer holding the uncompressed data.

Match data is read from the last write-position of the uncompressed data minus the match offset; it's then copied to the end of the uncompressed data.

This is repeated until the match offset is zero.

Thus in the above example, the literal length is [8], which means we'll need to copy 8 bytes from the literal data section to the output.

The match length is [f] (15, the maximum value for a nibble) which means more length information follows the match offset.

The literal data section contains the bytes: [31 36 61 0a 20 00 20 02] - we copy those directly to the end of the uncompressed data buffer.

The match offset is [01 00] (low byte first, then high byte, so the match offset is 1; this is how many bytes we step back from the end of the uncompressed buffer.

The complete match length is [f] + 4 + [07] = 1a (26), this is how many bytes we copy from the match offset to the end of the uncompressed buffer.

That means our decompressed data will be: [31 36 61 0a 20 00 20 02] [02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02]

-You see, the byte [02] is first copied to the end of the uncompressed buffer, the source pointer is then advanced and now points to the byte we just wrote [02].

The destination pointer is also advanced and points right after the byte we just wrote, thusote [02] will be repeated.

If the match offset had been 3, then the last 3 bytes [00 20 02] would be repeated instead, resulting in the following uncompressed data:

[31 36 61 0a 20 00 20 02] [00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20 02 00 20]

The code

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    .syntax             unified
                    .cpu                cortex-m0
                    .thumb
/* License: Public Domain - I cannot be held responsible for what it does or does not do if you use it, whether it's modified or not. */
/* Entry point = unlz4. On entry: r0 = source, r1 = destination. The first two bytes of the source must contain the length of the compressed data. */
                    .func               unlz4
                    .global             unlz4,unlz4_len
                    .type               unlz4,%function
                    .type               unlz4_len,%function
                    .thumb_func
unlz4:              ldrh                r2,[r0]             /* get length of compressed data */
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    .syntax             unified

                    .cpu                cortex-m0

                    .thumb

/* License: Public Domain - I cannot be held responsible for what it does or does not do if you use it, whether it's modified or not. */

/* Entry point = unlz4. On entry: r0 = source, r1 = destination. The first two bytes of the source must contain the length of the compressed data. */

                    .func               unlz4

                    .global             unlz4,unlz4_len

                    .type               unlz4,%function

                    .type               unlz4_len,%function

                    .thumb_func

unlz4:              ldrh                r2,[r0]             /* get length of compressed data */

                    adds                r0,r0,#2            /* advance source pointer */

                    .thumb_func

unlz4_len:          push                {r4-r6,lr}          /* save r4, r5, r6 and return-address */

                    adds                r5,r2,r0            /* point r5 to end of compressed data */

getToken:           ldrb                r6,[r0]             /* get token */

                    adds                r0,r0,#1            /* advance source pointer */

                    lsrs                r4,r6,#4            /* get literal length, keep token in r6 */

                    beq                 getOffset           /* jump forward if there are no literals */

                    bl                  getLength           /* get length of literals */

                    movs                r2,r0               /* point r2 to literals */

                    bl                  copyData            /* copy literals (r2=src, r1=dst, r4=len) */

                    movs                r0,r2               /* update source pointer */

getOffset:          ldrb                r3,[r0,#0]          /* get match offset's low byte */

                    subs                r2,r1,r3            /* subtract from destination; this will become the match position */

                    ldrb                r3,[r0,#1]          /* get match offset's high byte */

                    lsls                r3,r3,#8            /* shift to high byte */

                    subs                r2,r2,r3            /* subtract from match position */

                    adds                r0,r0,#2            /* advance source pointer */

                    lsls                r4,r6,#28           /* get rid of token's high 28 bits */

                    lsrs                r4,r4,#28           /* move the 4 low bits back where they were */

                    bl                  getLength           /* get length of match data */

                    adds                r4,r4,#4            /* minimum match length is 4 bytes */

                    bl                  copyData            /* copy match data (r2=src, r1=dst, r4=len) */

                    cmp                 r0,r5               /* check if we've reached the end of the compressed data */

                    blt                 getToken            /* if not, go get the next token */

                    pop                 {r4-r6,pc}          /* restore r4, r5 and r6, then return */

                    .thumb_func

getLength:          cmp                 r4,#0x0f            /* if length is 15, then more length info follows */

                    bne                 gotLength           /* jump forward if we have the complete length */

getLengthLoop:      ldrb                r3,[r0]             /* read another byte */

                    adds                r0,r0,#1            /* advance source pointer */

                    adds                r4,r4,r3            /* add byte to length */

                    cmp                 r3,#0xff            /* check if end reached */

                    beq                 getLengthLoop       /* if not, go round loop */

gotLength:          bx                  lr                  /* return */

                    .thumb_func

copyData:           rsbs                r4,r4,#0            /* index = -length */

                    subs                r2,r2,r4            /* point to end of source */

                    subs                r1,r1,r4            /* point to end of destination */

copyDataLoop:       ldrb                r3,[r2,r4]          /* read byte from source_end[-index] */

                    strb                r3,[r1,r4]          /* store byte in destination_end[-index] */

                    adds                r4,r4,#1            /* increment index */

                    bne                 copyDataLoop        /* keep going until index wraps to 0 */

                    bx                  lr                  /* return */

                    .size               unlz4,.-unlz4

                    .endfunc

/* 42 narrow instructions = 84 bytes */

Compressing your data

Since the data is going to be stored as compressed data in your flash memory, there's no need for a compression routine in your firmware.

Instead, you can prepare the data using your computer, for instance by using Yann Collet's tools. I recommend using the -9 switch for best possible compression.

You need to know that the 'lz4' compression tool insert a number of bytes as a header; this means you'll have to remove the first N bytes from the compressed file. The number of bytes are usually 11; but it could vary. A number of bytes follow the compressed data as well (this may include a checksum and other data).

When the data is compressed, you can use the GNU assembler's .incbin directive for including it in your firmware; or you could make a small tool in Perl to generate a .c file containing the compressed data as hex numbers in a static const array.

You can either place the length as two bytes in the beginning of the compressed block (before the first token) and point r0 to this length - or you can load the length of the compressed data into r2 and call unlz4_len instead.

The hexdump tool can be used for determining the length of the header:

hexdump -ve'"" 16/1 "%02x " "\n"' -n 12 bi1.lz4

04 22 4d 18 64 40 a7 31 43 00 00 f7

The first 4 bytes (04 22 4d 18) identifies the file as a LZ4 archive.

The next byte (64) is a flag byte, and this is the byte that we're interested in. If it has bit 3 set, then the header is 8 bytes longer than usual.

After the flag byte is a block descriptor byte (40)

Then a header checksum byte follows (a7)

A 32-bit word in Little Endian format follows. This is the length of the compressed data (0x00004331 = 17201 bytes)

Above I've dumped 12 bytes. The 12th byte is the first token byte of the compressed data (f7)

-So we can make a small perl one-liner, which finds out the size of the header:

hlen=`perl -e 'my ($b,$i,$f,$j,$l);if(read(STDIN,$b,11)){ ($i,$f,$j,$l) = unpack("H8CA2V", $b); } print(("$i" eq "04224d18") ? (11+($f & 8)) . " count=$l" : 0,"\n");' < bi1.lz4`

... now the shell variable $hlen contains the result, which is 0 if the file is not a lz4 file, but if it's a lz4 file, hlen is 11 or 19 followed by a space and "count=" and the size of the compressed data.

You can use the command-line tool 'dd' to cut the binary file:

dd if=bi1.lz4 of=bi1.bin bs=1 iseek=$hlen

Here's a small perl one-liner to convert a binary file to a .c file (replace bi1.lz4 and mySource.c at the end:

perl -e 'print("static const uint8_t sMyCompressedData[] = {\n" ); my $b; while(my $l=read(STDIN, $b, 16)){ printf("\t"); $b =~ s/(.)/printf("0x%02x, ", ord($1))/seg; print("\n");};print("};\n");' < bi1.bin >mySource.c

The above 3 one-liners can easily be converted or combined into a bash-script, which will extract the compressed data to its own file - or you can download the lz4cut script, which I've attached below.

Using the decompressor

Using the decompressor is quite easy. When calling the routine, you just need to point r0 to the compressed data and r1 to the address in RAM, where you want the decompressed data to go.

So in assembly language you could do it this way:

Fullscreen

1
2
3
4
5
       ldr                 r0,=ladybug256_lz4  /* point r0 to the compressed data */
                    ldr                 r1,=screen          /* point r1 to where the decompressed data should go */
                    bl                  unlz4               /* decompress the data */
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

       ldr                 r0,=ladybug256_lz4  /* point r0 to the compressed data */

                    ldr                 r1,=screen          /* point r1 to where the decompressed data should go */

                    bl                  unlz4               /* decompress the data */

To include the binary data, you can use the .incbin directive:

Fullscreen

1
2
3
4
5
6
7
                    .align              1                   /* make sure the length halfword is placed on an even address */
ladybug256_lz4:     .2byte              (ladybug256_lz4_end - ladybug256_lz4) /* generate a length value in front of the data */
                    .incbin             'ladybug256.lz4'    /* read binary file into this location */
ladybug256_lz4_end:                                         /* end of compressed data */
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

                    .align              1                   /* make sure the length halfword is placed on an even address */

ladybug256_lz4:     .2byte              (ladybug256_lz4_end - ladybug256_lz4) /* generate a length value in front of the data */

                    .incbin             'ladybug256.lz4'    /* read binary file into this location */

ladybug256_lz4_end:                                         /* end of compressed data */

If you want to use it from C or C++, you need to declare a function prototype:

Fullscreen

1
2
3
void unlz4(const void *aSource, void *aDestination);
void unlz4_len(const void *aSource, void *aDestination, uint32_t aLength);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

void unlz4(const void *aSource, void *aDestination);

void unlz4_len(const void *aSource, void *aDestination, uint32_t aLength);

Then you can call it this way:

unlz4(ladybug256_lz4, screen); /* decompress our data directly to the screen */

... or, if you think it's too tedious to include the length as the first 16 bit value in front of the compressed data ...

unlz4_len(castle_lz4, screen, sizeof(castle_lz4)); /* decompress our data directly to the screen */

This should allow you to squeeze more code and data into the flash memory of your Cortex-M based devices.

You can also use it with external SPI flash, because most Cortex-M microcontrollers can map external SPI Flash directly to your memory address space.

Update: Aug. 22: 2016: I've shaved off one clock cycle of the loop inside the copyData subroutine. The size of the routine is the same, but it performs better as there's one instruction less inside the loop. This change might make dramatic improvements for some use cases (especially real-time decompression. I recommend going through the copyData subroutine step-by-step, to see how it differs from a standard copy routine. If speed is an issue, remember that it's possible to unroll the subroutine.

Note: I've also added some one-liners and attached a full script in order to help you automate converting the files.

lz4cut.tar.bz2

20 comments
0 members are here

Top Comments

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

LZ4 decompression routine for Cortex-M0 and later

Introduction

Why use LZ4 compression ?

The compressed data

Example:

The code

Compressing your data

Using the decompressor

Top Comments

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC