A fairly quick Count Leading Zeroes for Cortex-M0

The Basics

Some of us need to find out how many leading zero-bits there are in a 32-bit word.

Such a feature is useful on many occasions, especially when writing a fast divide subroutine.

The Cortex-M3 and later have a CLZ instruction which can count those leading zeroes.

But as you probably already know, the Cortex-M0 does not have an instruction for this.

Instead you must use either a provided subroutine (usually written in C) or write your own.

Let's try rolling our own, shall we ?

First, let's define a few macros, so that we can make our source-code look nice and tidy.

It is a good idea to use unified syntax; start with this line at the top of your file, in order to make the assembler automatically create IT instructions:

.syntax             unified

The macros I'm writing in this document are in the GNU-Assembler (GAS) style, but if you're using a different assembler, I think they will not be too difficult to convert:

.macro              FUNCTION name

.global             \name

.func               \name,\name

.type               \name,%function

.thumb_func

.\name\():

.endm

.macro              ENDFUNC name

.pool

.size               \name,. - \name

.endfunc

.endm

A simple 'naive' solution would be to count each bit. While this is slow, it's actually very short, so it's worth mentioning.

If you've read any of my earlier documents, you will know that I tend to mention the number of clock cycles that an instruction uses. Those are the numbers in square brackets.

FUNCTION            _clz_naive          /* in: r0 = 32-bit word to count leading zero bits in. r1=destroyed. */

movs                r1,#32              /* [1] assume we have 32 zero bits */

tst                 r0,r0               /* [1] check if r0 is zero */

beq                 clz_done\@          /* [1/3] if it's zero, we're done */

clz_loop\@:         subs                r1,r1,#1            /* [1] decrement leading zero bit count */

lsrs                r0,r0,#1            /* [1] shift r0 to the right */

bne                 clz_loop\@          /* [1/3] if r0 didn't become zero yet, have another go at it */

clz_done\@:         movs                r0,r1               /* [1] transfer result to r0 */

ENDFUNC             _clz_naive          /* out: result. Total CPU time spent: between 6 and 162 clock cycles) */

Worst case for the above snippet is 162 clock cycles, not bad, huh ?

We can do better than that!

Optimization

If you've written a routine to count leading zeroes, you are probably familiar with a couple of shortcuts.

It's common to use logical AND to check if the first 16 bits are clear, if so, we can save a lot of clock-cycles on testing the first 16 bits alone.

In the same way, the next 8 bits are tested, then 4 bits, 2 bits and 1 bit.

But let's also remember that we're using an ARM microcontroller here.

As you might know, ARM microcontrollers are pretty fond of bit-shifting.

Can we take advantage of this ?

Let's try writing a macro, which checks if the top N bits are zero:

.macro              check_bits_t bits

lsrs                r2,r1,#\bits        /* [1] test next N bits. This updates the Z flag */

beq                 skip\@              /* [1/3] if all the bits were zero, jump forward */

subs                r0,r0,#\bits        /* [1] otherwise decrement our counter (the result) */

mov                 r1,r2               /* [1] and keep the new value */

skip\@:

.endm                                   /* (4) this macro will always use 4 clock cycles */

So what we do here, is that we take the word containing our bits to test, shift it to the right N times. Let's assume that's 16 times, so a 32-bit word like this ...

...would look like this after shifting it...

00000000000000000010100101010111   (0x00002957)

...that means we're testing if all of the high 16 bits are zero. It seems we weren't that lucky this time, though.

But we've still saved a lot of time. We now know that we have less than 16 leading zeroes, correct ?

So we subtract 16 from our counter.

Since the high 16 bits did not contain all zeroes, we'll also keep the shifted value, so we can count those bits instead of the low 16 bits.

Had the high 16 bits been zero, we would want to get the original value with the low 16 bits intact.

We can now use this macro a couple of times, in order to count all our bits.

Let's have a go at it.

First of all, the above macro wants r1 to contain the word we want to count bits in.

It also wants r0 to be the counter. As we're subtracting from the counter, we'll start it at 32, because there are 32 bits in a 32-bit word, right ?

.macro              _clz rD,rS          /* count leading zeroes in rS, result is in rD. Destroys r1 and r2 */

movs                r1,\rS              /* [1] transfer word to count to r1 */

movs                r0,#32              /* [1] we have 32 bits in a 32-bit word, remember ? */

check_bits_t        16                  /* [4] test first 16 bits */

check_bits_t        8                   /* [4] test bits 31..24 or 15..8 */

check_bits_t        4                   /* [4] test bits 31..28, 23..20, 15..12 or 7..4 */

check_bits_t        2                   /* [4] test bits 31..30, 27..26, 23..22, 19..18, 15..14, 11..10, 7..6 or 3..2 */

check_bits_t        1                   /* [4] test bits 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3 or 1 */

subs                \rD,r0,r1           /* [1] subtract the value of the last bit from the result */

.endm                                   /* (26) this macro always uses 26 clock cycles */

We're so lucky this time, that we can supply the macro with two registers, a destination register and a source register, so we can use it like this:

_clz                r4,r7

The leading zeroes in r7 will be counted and the result will be placed in r4.

You can also create a subroutine, if you need to use the functionality from C:

FUNCTION            _clz

_clz                                    /* [26] count the bits in r0 */

bx                  lr                  /* [3] return (29 clock cycles) */

ENDFUNC             _qclz

The calling instruction will cost 4 or 5 extra clock cycles, depending on how it's done. C-compilers usually pick the more expensive solution.

Alright, we're now able to count 32 bits in less than 32 clock cycles; that's quicker than some implementations of the MUL instruction.

But ...

We can do better than that!

Pushing it

Let's combine the above with a table-lookup. This will cost you!

The price is 256 bytes for a fast solution; and today we want it fast, otherwise we would be satisfied with the above macro, right ?

-Well, the 16-byte table would be useful if you're running out of space and still need to count bits often.

Let's have a look at the lookup-table solution.

We'll start out counting the first 16 bits, then 8 bits as we did above.

.macro              _qclz               /* this macro will destroy r1 and r2 */

movs                r1,#32              /* [1] we have 32 bits in a 32-bit word as usual */

lsrs                r2,r0,#16           /* [1] test high 16 bits */

beq                 skip16\@            /* [1/3] if zero, jump forward */

subs                r1,r1,#16           /* [1] otherwise decrement our counter (result) */

mov                 r0,r2               /* [1] and keep new value */

skip16\@:           lsrs                r2,r0,#8            /* [1] test bits 31..23 or 15..8 */

beq                 skip8\@             /* [1/3] if zero, jump forward */

subs                r1,r1,#8            /* [1] otherwise decrement our counter (result) */

mov                 r0,r2               /* [1] and keep new value */

skip8\@:            ldr                 r2,=_qclz_table     /* [2] point to our look-up table */

ldrb                r0,[r2,r0]          /* [2] convert byte to a count between 0 and 8 */

adds                r0,r0,r1            /* [1] add the number of bits we've already counted (8, 16 or 24) */

.endm                                   /* (14) this macro always uses 14 clock cycles */

This time, I didn't make the macro so convenient that it takes parameters. The reason is that r1 is destroyed, so if supplying r1 as parameter, we would not get the desired result.

It is possible, though, to write the macro, so it tests if we're using r1, and then it could swap the use of r1 and r2; but it would ruin the clarity of this document.

If you want to make the macro check which register is used, you can use the '.ifc' directive. It goes something like this:

.ifc                "\rS","r1"          /* (if the parameter is r1) */

...

.else                                   /* (the parameter is some other register) */

...

.endif

The _qclz macro uses only 14 clock cycles; personally, I find that quite acceptable when comparing to the original 'worst case' 162 cycles.

OK, now all, that we need is a table (a desk won't do). Let's create one. We'll start by creating a macro that makes the task easier...

.macro              rep_byte byte,count

.rept               \count

.byte               \byte

.endr

.endm

The above macro simply repeats the specified byte <count> times.

Let's generate the table then; since we're already working with macros, we'll continue, so you can put them in a .i include file, that they may be re-used in different code...

.macro              _qclz_table

_qclz_table:        rep_byte            8,1                 /* table entry 0 */

rep_byte            7,1                 /* table entry 1 */

rep_byte            6,2                 /* table entry 2 */

rep_byte            5,4                 /* table entry 3 */

rep_byte            4,8                 /* table entry 4 */

rep_byte            3,16                /* table entry 5 */

rep_byte            2,32                /* table entry 6 */

rep_byte            1,64                /* table entry 7 */

rep_byte            0,128               /* table entry 8 */

.endm

In order to use the qclz, you can instantiate the macro as a subroutine, or you can use it directly.

Let's imagine that you need to call it from C, so you can do the following:

FUNCTION            _qclz

_qclz                                   /* [14] count the bits in r0 */

bx                  lr                  /* [3] return (17 clock cycles) */

.pool                                   /* we'll need the literal pool close to the function */

_qclz_table                             /* we'll insert the table here */

.align                                  /* let the assembler insert any necessary padding */

ENDFUNC             _qclz

The literal pool might not be necessary, depending on your assembler, because the ldr r2,=_qclz_table can be optimized into an adr instruction.

If using the _qclz subroutine, a call to it using the bl instruction would cost 4 extra clock cycles, so that would be a total of 21 clock cycles.

One more

You can easily create a Count Leading Ones by inverting all 32 bits in the input word.

.macro              _qclo               /* this macro will destroy r1 and r2 */

mvns                r0,r0               /* [1] flip all bits in r0 */

_qclz                                   /* [14] count the leading zero bits (which were actually ones) */

.endm

-Remember you can also make a complete duplicate of the _qclz macro and modify it, in order to save the last clock cycle if needed.

Trailer

You can make a check_bits_l, which uses lsls instead of lsrs, so you can make a Count Trailing Zeroes (_ctz) / Count Trailing Ones (_cto) macro.

You can make a _qctz or _qcto implementation that has a lsls instead of the lsrs. You would also need to change the table.

Saving space is nice, but I recommend writing 4 optimized macros, that you have ready always.

Some of you may want the full 'library' containing a 256-byte look-up table, a 16-byte look-up table and 3 sets of 4 macros.

The reason I am not providing such a full library as an attachment here, is that noone will learn from just including an existing file.

It's much more fun to be a part of it yourself and understand each detail of what's going on.

Related articles

• I have been working on this little issue for a while. When you have a tiny memory budget, 256 bytes may result in something not fitting in-line thus needing a subroutine and of course using cycles. I noted this in the paper version of the original 'Hacker's Delight'. The work of Julius Goryavsky on Harley's algorithm.

static char table [64] =
{32,20,19,0,0,18,0,7,10,17,0,0,14,0,06,0,
0,9,0,16,0,0,01,26,0,13,0,0,24,05,0,0,
0,21,0,8,11,0,15,0,0,0,0,2,27,0,25,0,
22,0,12,0,0,3,28,0,23,0,4,29,0,0,30,31}

x = x &∼ (x>>16);
x = x * oxFD7039FF

return table[x >> 26);

I'm a pure assembly-language coder but it strikes me that the VERY clever Count_Trailing_Zeros is related to this using, I suspect, a De Bruijn-like sequence. I was attracted by the 64 byte table. I don't know how much ASM people use but the M0 has it's own zero-page like properties. If an address can be loaded in a single cycle, the code lookup gets quicker. A 32-byte table for the CLZ (or 256 if this doesn't work) and possible other strategies to use the entire bus bandwidth. I'm working with an M0+ with the 1-cycle multiply and a 64-byte cache so code size (and the memory it accesses) is important because I need 2 DMAs working (MicroSD -> RAM and RAM -> DAC. Obviously the smaller the lookup, the more likely I can place it on the Zero-Page and not call a subroutine i.e. keep it in-line.

Many thanks!

• lix2ng wrote:

_qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?

Well spotted. I don't know how this could get past my eyes, but before I make any corrections, I'll make sure that I didn't intend to use 'subs' instead of 'adds' at the end.

The GCC code is absolutely great, but sometimes it will pay to care for details, because the gain can be great. The smaller a snippet of repeated code gets, the more a single clock cycle means.

btw2 if it's made a routine: replacing LDR with ADR to load the table address can save a cycle and a literal (gcc version does this).

You're right, but you should also know that the GNU assembler will automatically use ADR if the table is near.

I've used LDR in order to make the code a little more flexible, since the optimization should be automatic.

(in fact, "LDR=" is nothing but a pseudo-instruction, so it's not even a real LDR instruction).

-But don't take my word for it; verify the code, so you're sure.

• _qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?

EDIT: btw this code is better than the library routine in ARM/embedded gcc (5.4.1), which uses a sliding bit tester and a 16 entry LUT. It's 5 ins per bisection.

btw2 if it's made a routine: replacing LDR with ADR to load the table address can save a cycle and a literal (gcc version does this).

• Thank you.

-It was a real joy writing the above code. I like to keep moving things around, until it becomes possible to save another clock-cycle.

For routines / snippets like this one, I think it really pays to give it some extra care, so that it can be re-used without needing to modify it.

I sometimes miss some of the ARM7/ARM9/ARM11 second-operand features; especially LDR rT,[rB,rI,LSR#16] - and also the ability to subtract the index register from the base register.

It would be real neat to have the LSR on the LDR instruction for fixed-point interpolation.

Subtracting the index register would be particularly useful when writing decompression routines, but also when copying blocks of data manually.

• Deploy State-of-the-Art Deep Learning on Edge Devices in Minutes

Deploying advanced deep learning algorithms on edge devices, especially for computer vision applications like autonomous vehicles and IoT, requires special capabilities.
• New online training course – An Introduction to Armv8-M

We are pleased to announce a new online training topic: An Introduction to Armv8-M. This is the latest addition to our introduction to Arm the Arm architecture series.
• Open standards: Enabling high performance infrastructure systems with CCIX

Designers looking for an open HP cache-coherent chip to chip accelerator interface with an easy migration path from today’s PCIe should consider CCIX for their next SoC.