A fairly quick Count Leading Zeroes for Cortex-M0

October 3, 2014

9 minute read time.

The Basics

Some of us need to find out how many leading zero-bits there are in a 32-bit word. Such a feature is useful on many occasions, especially when writing a fast divide subroutine.

The Cortex-M3 and later have a CLZ instruction which can count those leading zeroes. But as you probably already know, the Cortex-M0 does not have an instruction for this. Instead you must use either a provided subroutine (usually written in C) or write your own.

Let's try rolling our own, shall we ?

First, let's define a few macros, so that we can make our source-code look nice and tidy.

It is a good idea to use unified syntax; start with this line at the top of your file, in order to make the assembler automatically create IT instructions:

.syntax unified

The macros I'm writing in this document are in the GNU-Assembler (GAS) style, but if you're using a different assembler, I think they will not be too difficult to convert:

                   .macro              FUNCTION name

                    .global             \name

                    .func               \name,\name

                    .type               \name,%function

                    .thumb_func

\name\():

                    .endm

                    .macro              ENDFUNC name

                    .pool

                    .size               \name,. - \name

                    .endfunc

                    .endm

A simple 'naive' solution would be to count each bit. While this is slow, it's actually very short, so it's worth mentioning.

If you've read any of my earlier documents, you will know that I tend to mention the number of clock cycles that an instruction uses. Those are the numbers in square brackets.

 FUNCTION            _clz_naive          /* in: r0 = 32-bit word to count leading zero bits in. r1=destroyed. */

                    movs                r1,#32              /* [1] assume we have 32 zero bits */

                    tst                 r0,r0               /* [1] check if r0 is zero */

                    beq                 clz_done\@          /* [1/3] if it's zero, we're done */

clz_loop\@:         subs                r1,r1,#1            /* [1] decrement leading zero bit count */

                    lsrs                r0,r0,#1            /* [1] shift r0 to the right */

                    bne                 clz_loop\@          /* [1/3] if r0 didn't become zero yet, have another go at it */

clz_done\@:         movs                r0,r1               /* [1] transfer result to r0 */

                    ENDFUNC             _clz_naive          /* out: result. Total CPU time spent: between 6 and 162 clock cycles) */

Worst case for the above snippet is 162 clock cycles, not bad, huh? We can do better than that!

Optimization

If you've written a routine to count leading zeroes, you are probably familiar with a couple of shortcuts. It's common to use logical AND to check if the first 16 bits are clear, if so, we can save a lot of clock-cycles on testing the first 16 bits alone.

In the same way, the next 8 bits are tested, then 4 bits, 2 bits and 1 bit. But let's also remember that we're using an ARM microcontroller here. As you might know, ARM microcontrollers are pretty fond of bit-shifting.

Can we take advantage of this? Let's try writing a macro, which checks if the top N bits are zero:

                 .macro              check_bits_t bits

                    lsrs                r2,r1,#\bits        /* [1] test next N bits. This updates the Z flag */

                    beq                 skip\@              /* [1/3] if all the bits were zero, jump forward */

                    subs                r0,r0,#\bits        /* [1] otherwise decrement our counter (the result) */

                    mov                 r1,r2               /* [1] and keep the new value */

skip\@:

                    .endm                                   /* (4) this macro will always use 4 clock cycles */

So what we do here, is that we take the word containing our bits to test, shift it to the right N times. Let's assume that's 16 times, so a 32-bit word like this ...

00101001010101111010110101001111 (0x2957ad4f)

...would look like this after shifting it...

00000000000000000010100101010111 (0x00002957)

...that means we're testing if all of the high 16 bits are zero. It seems we weren't that lucky this time, though.

But we've still saved a lot of time. We now know that we have less than 16 leading zeroes, correct ? So we subtract 16 from our counter.

Since the high 16 bits did not contain all zeroes, we'll also keep the shifted value, so we can count those bits instead of the low 16 bits. Had the high 16 bits been zero, we would want to get the original value with the low 16 bits intact.

We can now use this macro a couple of times, in order to count all our bits. Let's have a go at it.

First of all, the above macro wants r1 to contain the word we want to count bits in. It also wants r0 to be the counter. As we're subtracting from the counter, we'll start it at 32, because there are 32 bits in a 32-bit word, right ?

                .macro              _clz rD,rS          /* count leading zeroes in rS, result is in rD. Destroys r1 and r2 */

                    movs                r1,\rS              /* [1] transfer word to count to r1 */

                    movs                r0,#32              /* [1] we have 32 bits in a 32-bit word, remember ? */

                    check_bits_t        16                  /* [4] test first 16 bits */

                    check_bits_t        8                   /* [4] test bits 31..24 or 15..8 */

                    check_bits_t        4                   /* [4] test bits 31..28, 23..20, 15..12 or 7..4 */

                    check_bits_t        2                   /* [4] test bits 31..30, 27..26, 23..22, 19..18, 15..14, 11..10, 7..6 or 3..2 */

                    check_bits_t        1                   /* [4] test bits 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3 or 1 */

                    subs                \rD,r0,r1           /* [1] subtract the value of the last bit from the result */

                    .endm                                   /* (26) this macro always uses 26 clock cycles */

We're so lucky this time, that we can supply the macro with two registers, a destination register and a source register, so we can use it like this:

_clz r4,r7

The leading zeroes in r7 will be counted and the result will be placed in r4.

You can also create a subroutine, if you need to use the functionality from C:

                    FUNCTION            _clz

                    _clz                                    /* [26] count the bits in r0 */

                    bx                  lr                  /* [3] return (29 clock cycles) */

                    ENDFUNC             _qclz

The calling instruction will cost 4 or 5 extra clock cycles, depending on how it's done. C-compilers usually pick the more expensive solution. Alright, we're now able to count 32 bits in less than 32 clock cycles; that's quicker than some implementations of the MUL instruction.

But, we can do better than that!

Pushing it

Let's combine the above with a table-lookup. This will cost you!

The price is 256 bytes for a fast solution; and today we want it fast, otherwise we would be satisfied with the above macro, right? Well, the 16-byte table would be useful if you're running out of space and still need to count bits often.

Let's have a look at the lookup-table solution. We'll start out counting the first 16 bits, then 8 bits as we did above.

                    .macro              _qclz               /* this macro will destroy r1 and r2 */

                    movs                r1,#32              /* [1] we have 32 bits in a 32-bit word as usual */

                    lsrs                r2,r0,#16           /* [1] test high 16 bits */

                    beq                 skip16\@            /* [1/3] if zero, jump forward */

                    subs                r1,r1,#16           /* [1] otherwise decrement our counter (result) */

                    mov                 r0,r2               /* [1] and keep new value */

skip16\@:           lsrs                r2,r0,#8            /* [1] test bits 31..23 or 15..8 */

                    beq                 skip8\@             /* [1/3] if zero, jump forward */

                    subs                r1,r1,#8            /* [1] otherwise decrement our counter (result) */

                    mov                 r0,r2               /* [1] and keep new value */

skip8\@:            ldr                 r2,=_qclz_table     /* [2] point to our look-up table */

                    ldrb                r0,[r2,r0]          /* [2] convert byte to a count between 0 and 8 */

                    adds                r0,r0,r1            /* [1] add the number of bits we've already counted (8, 16 or 24) */

                    .endm                                   /* (14) this macro always uses 14 clock cycles */

This time, I didn't make the macro so convenient that it takes parameters. The reason is that r1 is destroyed, so if supplying r1 as parameter, we would not get the desired result. It is possible, though, to write the macro, so it tests if we're using r1, and then it could swap the use of r1 and r2; but it would ruin the clarity of this document.

If you want to make the macro check which register is used, you can use the '.ifc' directive. It goes something like this:

                    .ifc                "\rS","r1"          /* (if the parameter is r1) */

                    ...

                    .else                                   /* (the parameter is some other register) */

                    ...

                    .endif

The _qclz macro uses only 14 clock cycles; personally, I find that quite acceptable when comparing to the original 'worst case' 162 cycles.

OK, now all, that we need is a table (a desk won't do). Let's create one. We'll start by creating a macro that makes the task easier...

                    .macro              rep_byte byte,count

                    .rept               \count

                    .byte               \byte

                    .endr

                    .endm

The above macro simply repeats the specified byte <count> times.

Let's generate the table then; since we're already working with macros, we'll continue, so you can put them in a .i include file, that they may be re-used in different code...

                    .macro              _qclz_table

_qclz_table:        rep_byte            8,1                 /* table entry 0 */

                    rep_byte            7,1                 /* table entry 1 */

                    rep_byte            6,2                 /* table entry 2 */

                    rep_byte            5,4                 /* table entry 3 */

                    rep_byte            4,8                 /* table entry 4 */

                    rep_byte            3,16                /* table entry 5 */

                    rep_byte            2,32                /* table entry 6 */

                    rep_byte            1,64                /* table entry 7 */

                    rep_byte            0,128               /* table entry 8 */

                    .endm

In order to use the qclz, you can instantiate the macro as a subroutine, or you can use it directly.

Let's imagine that you need to call it from C, so you can do the following:

                    FUNCTION            _qclz

                    _qclz                                   /* [14] count the bits in r0 */

                    bx                  lr                  /* [3] return (17 clock cycles) */

                    .pool                                   /* we'll need the literal pool close to the function */

                    _qclz_table                             /* we'll insert the table here */

                    .align                                  /* let the assembler insert any necessary padding */

                    ENDFUNC             _qclz

The literal pool might not be necessary, depending on your assembler, because the ldr r2,=_qclz_table can be optimized into an adr instruction.

If using the _qclz subroutine, a call to it using the bl instruction would cost 4 extra clock cycles, so that would be a total of 21 clock cycles.

One more

You can easily create a Count Leading Ones by inverting all 32 bits in the input word.

                    .macro              _qclo               /* this macro will destroy r1 and r2 */

                    mvns                r0,r0               /* [1] flip all bits in r0 */

                    _qclz                                   /* [14] count the leading zero bits (which were actually ones) */

                    .endm

Remember you can also make a complete duplicate of the _qclz macro and modify it, in order to save the last clock cycle if needed.

Trailer

You can make a check_bits_l, which uses lsls instead of lsrs, so you can make a Count Trailing Zeroes (_ctz) / Count Trailing Ones (_cto) macro. You can make a _qctz or _qcto implementation that has a lsls instead of the lsrs. You would also need to change the table. Saving space is nice, but I recommend writing 4 optimized macros, that you have ready always.

Some of you may want the full 'library' containing a 256-byte look-up table, a 16-byte look-up table and 3 sets of 4 macros. The reason I am not providing such a full library as an attachment here, is that noone will learn from just including an existing file. It's much more fun to be a part of it yourself and understand each detail of what's going on.

Shoichi Kojima over 2 years ago in reply to Sean Dunlevy

Multiplying value seems to be wrong. That would be 0xFD7049FFUL
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Dunlevy over 4 years ago

I have a platform-specific optimization. The bottom of memory of the M0 has the vectors but then a space. If you can place the start of the 256 byte lookup table at an address below $100, it can be set up using a MOVS Rd,#<imm> instruction. Placing the lookup table somewhere else also means that if you use CLZ a lot, you don't end up needing to place it in multiple literal-pools so it saves a cycle and however many duplicates you would need.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jens Bauer over 6 years ago in reply to Jens Bauer

For anyone having problems, I can confirm that what Ling LI says is correct. r1 must be initially loaded with #24, not #32.

Unfortunately I can not edit the post, so I cannot correct the article myself.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Dunlevy over 6 years ago

I have been working on this little issue for a while. When you have a tiny memory budget, 256 bytes may result in something not fitting in-line thus needing a subroutine and of course using cycles. I noted this in the paper version of the original 'Hacker's Delight'. The work of Julius Goryavsky on Harley's algorithm.

static char table [64] =
{32,20,19,0,0,18,0,7,10,17,0,0,14,0,06,0,
0,9,0,16,0,0,01,26,0,13,0,0,24,05,0,0,
0,21,0,8,11,0,15,0,0,0,0,2,27,0,25,0,
22,0,12,0,0,3,28,0,23,0,4,29,0,0,30,31}

x = x &∼ (x>>16);
x = x * oxFD7039FF

return table[x >> 26);

I'm a pure assembly-language coder but it strikes me that the VERY clever Count_Trailing_Zeros is related to this using, I suspect, a De Bruijn-like sequence. I was attracted by the 64 byte table. I don't know how much ASM people use but the M0 has it's own zero-page like properties. If an address can be loaded in a single cycle, the code lookup gets quicker. A 32-byte table for the CLZ (or 256 if this doesn't work) and possible other strategies to use the entire bus bandwidth. I'm working with an M0+ with the 1-cycle multiply and a 64-byte cache so code size (and the memory it accesses) is important because I need 2 DMAs working (MicroSD -> RAM and RAM -> DAC. Obviously the smaller the lookup, the more likely I can place it on the Zero-Page and not call a subroutine i.e. keep it in-line.

Many thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jens Bauer over 7 years ago

lix2ng wrote:
_qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?
Well spotted. I don't know how this could get past my eyes, but before I make any corrections, I'll make sure that I didn't intend to use 'subs' instead of 'adds' at the end.
The GCC code is absolutely great, but sometimes it will pay to care for details, because the gain can be great. The smaller a snippet of repeated code gets, the more a single clock cycle means.
btw2 if it's made a routine: replacing LDR with ADR to load the table address can save a cycle and a literal (gcc version does this).
You're right, but you should also know that the GNU assembler will automatically use ADR if the table is near.
I've used LDR in order to make the code a little more flexible, since the optimization should be automatic.
(in fact, "LDR=" is nothing but a pseudo-instruction, so it's not even a real LDR instruction).
-But don't take my word for it; verify the code, so you're sure.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024
Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Brian Jeff

We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
- December 8, 2023
Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

Anton Kirilov

We discuss some of the challenges that we face when we are trying to implement the WebAssembly SIMD bitmask operations on the 64-bit Arm architecture.
- December 6, 2023

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

A fairly quick Count Leading Zeroes for Cortex-M0

The Basics

Optimization

Pushing it

One more

Trailer

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture