Some of us need to find out how many leading zero-bits there are in a 32-bit word. Such a feature is useful on many occasions, especially when writing a fast divide subroutine.
The Cortex-M3 and later have a CLZ instruction which can count those leading zeroes. But as you probably already know, the Cortex-M0 does not have an instruction for this. Instead you must use either a provided subroutine (usually written in C) or write your own.
Let's try rolling our own, shall we ?
First, let's define a few macros, so that we can make our source-code look nice and tidy.
It is a good idea to use unified syntax; start with this line at the top of your file, in order to make the assembler automatically create IT instructions:
.syntax unified
.macro FUNCTION name .global \name .func \name,\name .type \name,%function .thumb_func
\name\(): .endm .macro ENDFUNC name .pool .size \name,. - \name .endfunc .endm
A simple 'naive' solution would be to count each bit. While this is slow, it's actually very short, so it's worth mentioning.
If you've read any of my earlier documents, you will know that I tend to mention the number of clock cycles that an instruction uses. Those are the numbers in square brackets.
FUNCTION _clz_naive /* in: r0 = 32-bit word to count leading zero bits in. r1=destroyed. */ movs r1,#32 /* [1] assume we have 32 zero bits */ tst r0,r0 /* [1] check if r0 is zero */ beq clz_done\@ /* [1/3] if it's zero, we're done */ clz_loop\@: subs r1,r1,#1 /* [1] decrement leading zero bit count */ lsrs r0,r0,#1 /* [1] shift r0 to the right */ bne clz_loop\@ /* [1/3] if r0 didn't become zero yet, have another go at it */ clz_done\@: movs r0,r1 /* [1] transfer result to r0 */ ENDFUNC _clz_naive /* out: result. Total CPU time spent: between 6 and 162 clock cycles) */
Worst case for the above snippet is 162 clock cycles, not bad, huh? We can do better than that!
If you've written a routine to count leading zeroes, you are probably familiar with a couple of shortcuts. It's common to use logical AND to check if the first 16 bits are clear, if so, we can save a lot of clock-cycles on testing the first 16 bits alone.
In the same way, the next 8 bits are tested, then 4 bits, 2 bits and 1 bit. But let's also remember that we're using an ARM microcontroller here. As you might know, ARM microcontrollers are pretty fond of bit-shifting.
Can we take advantage of this? Let's try writing a macro, which checks if the top N bits are zero:
.macro check_bits_t bits lsrs r2,r1,#\bits /* [1] test next N bits. This updates the Z flag */ beq skip\@ /* [1/3] if all the bits were zero, jump forward */ subs r0,r0,#\bits /* [1] otherwise decrement our counter (the result) */ mov r1,r2 /* [1] and keep the new value */ skip\@: .endm /* (4) this macro will always use 4 clock cycles */
So what we do here, is that we take the word containing our bits to test, shift it to the right N times. Let's assume that's 16 times, so a 32-bit word like this ...
00101001010101111010110101001111 (0x2957ad4f)
...would look like this after shifting it...
00000000000000000010100101010111 (0x00002957)
...that means we're testing if all of the high 16 bits are zero. It seems we weren't that lucky this time, though.
But we've still saved a lot of time. We now know that we have less than 16 leading zeroes, correct ? So we subtract 16 from our counter.
Since the high 16 bits did not contain all zeroes, we'll also keep the shifted value, so we can count those bits instead of the low 16 bits. Had the high 16 bits been zero, we would want to get the original value with the low 16 bits intact.
We can now use this macro a couple of times, in order to count all our bits. Let's have a go at it.
First of all, the above macro wants r1 to contain the word we want to count bits in. It also wants r0 to be the counter. As we're subtracting from the counter, we'll start it at 32, because there are 32 bits in a 32-bit word, right ?
.macro _clz rD,rS /* count leading zeroes in rS, result is in rD. Destroys r1 and r2 */ movs r1,\rS /* [1] transfer word to count to r1 */ movs r0,#32 /* [1] we have 32 bits in a 32-bit word, remember ? */ check_bits_t 16 /* [4] test first 16 bits */ check_bits_t 8 /* [4] test bits 31..24 or 15..8 */ check_bits_t 4 /* [4] test bits 31..28, 23..20, 15..12 or 7..4 */ check_bits_t 2 /* [4] test bits 31..30, 27..26, 23..22, 19..18, 15..14, 11..10, 7..6 or 3..2 */ check_bits_t 1 /* [4] test bits 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3 or 1 */ subs \rD,r0,r1 /* [1] subtract the value of the last bit from the result */ .endm /* (26) this macro always uses 26 clock cycles */
We're so lucky this time, that we can supply the macro with two registers, a destination register and a source register, so we can use it like this:
_clz r4,r7
The leading zeroes in r7 will be counted and the result will be placed in r4.
You can also create a subroutine, if you need to use the functionality from C:
FUNCTION _clz _clz /* [26] count the bits in r0 */ bx lr /* [3] return (29 clock cycles) */ ENDFUNC _qclz
The calling instruction will cost 4 or 5 extra clock cycles, depending on how it's done. C-compilers usually pick the more expensive solution. Alright, we're now able to count 32 bits in less than 32 clock cycles; that's quicker than some implementations of the MUL instruction.
But, we can do better than that!
Let's combine the above with a table-lookup. This will cost you!
The price is 256 bytes for a fast solution; and today we want it fast, otherwise we would be satisfied with the above macro, right? Well, the 16-byte table would be useful if you're running out of space and still need to count bits often.
Let's have a look at the lookup-table solution. We'll start out counting the first 16 bits, then 8 bits as we did above.
.macro _qclz /* this macro will destroy r1 and r2 */ movs r1,#32 /* [1] we have 32 bits in a 32-bit word as usual */ lsrs r2,r0,#16 /* [1] test high 16 bits */ beq skip16\@ /* [1/3] if zero, jump forward */ subs r1,r1,#16 /* [1] otherwise decrement our counter (result) */ mov r0,r2 /* [1] and keep new value */ skip16\@: lsrs r2,r0,#8 /* [1] test bits 31..23 or 15..8 */ beq skip8\@ /* [1/3] if zero, jump forward */ subs r1,r1,#8 /* [1] otherwise decrement our counter (result) */ mov r0,r2 /* [1] and keep new value */ skip8\@: ldr r2,=_qclz_table /* [2] point to our look-up table */ ldrb r0,[r2,r0] /* [2] convert byte to a count between 0 and 8 */ adds r0,r0,r1 /* [1] add the number of bits we've already counted (8, 16 or 24) */ .endm /* (14) this macro always uses 14 clock cycles */
This time, I didn't make the macro so convenient that it takes parameters. The reason is that r1 is destroyed, so if supplying r1 as parameter, we would not get the desired result. It is possible, though, to write the macro, so it tests if we're using r1, and then it could swap the use of r1 and r2; but it would ruin the clarity of this document.
If you want to make the macro check which register is used, you can use the '.ifc' directive. It goes something like this:
.ifc "\rS","r1" /* (if the parameter is r1) */ ... .else /* (the parameter is some other register) */ ... .endif
The _qclz macro uses only 14 clock cycles; personally, I find that quite acceptable when comparing to the original 'worst case' 162 cycles.
OK, now all, that we need is a table (a desk won't do). Let's create one. We'll start by creating a macro that makes the task easier...
.macro rep_byte byte,count .rept \count .byte \byte .endr .endm
The above macro simply repeats the specified byte <count> times.
Let's generate the table then; since we're already working with macros, we'll continue, so you can put them in a .i include file, that they may be re-used in different code...
.macro _qclz_table _qclz_table: rep_byte 8,1 /* table entry 0 */ rep_byte 7,1 /* table entry 1 */ rep_byte 6,2 /* table entry 2 */ rep_byte 5,4 /* table entry 3 */ rep_byte 4,8 /* table entry 4 */ rep_byte 3,16 /* table entry 5 */ rep_byte 2,32 /* table entry 6 */ rep_byte 1,64 /* table entry 7 */ rep_byte 0,128 /* table entry 8 */ .endm
In order to use the qclz, you can instantiate the macro as a subroutine, or you can use it directly.
Let's imagine that you need to call it from C, so you can do the following:
FUNCTION _qclz _qclz /* [14] count the bits in r0 */ bx lr /* [3] return (17 clock cycles) */ .pool /* we'll need the literal pool close to the function */ _qclz_table /* we'll insert the table here */ .align /* let the assembler insert any necessary padding */ ENDFUNC _qclz
The literal pool might not be necessary, depending on your assembler, because the ldr r2,=_qclz_table can be optimized into an adr instruction.
If using the _qclz subroutine, a call to it using the bl instruction would cost 4 extra clock cycles, so that would be a total of 21 clock cycles.
You can easily create a Count Leading Ones by inverting all 32 bits in the input word.
.macro _qclo /* this macro will destroy r1 and r2 */ mvns r0,r0 /* [1] flip all bits in r0 */ _qclz /* [14] count the leading zero bits (which were actually ones) */ .endm
Remember you can also make a complete duplicate of the _qclz macro and modify it, in order to save the last clock cycle if needed.
You can make a check_bits_l, which uses lsls instead of lsrs, so you can make a Count Trailing Zeroes (_ctz) / Count Trailing Ones (_cto) macro. You can make a _qctz or _qcto implementation that has a lsls instead of the lsrs. You would also need to change the table. Saving space is nice, but I recommend writing 4 optimized macros, that you have ready always.
Some of you may want the full 'library' containing a 256-byte look-up table, a 16-byte look-up table and 3 sets of 4 macros. The reason I am not providing such a full library as an attachment here, is that noone will learn from just including an existing file. It's much more fun to be a part of it yourself and understand each detail of what's going on.
_qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?
EDIT: btw this code is better than the library routine in ARM/embedded gcc (5.4.1), which uses a sliding bit tester and a 16 entry LUT. It's 5 ins per bisection.
btw2 if it's made a routine: replacing LDR with ADR to load the table address can save a cycle and a literal (gcc version does this).
Thank you.
-It was a real joy writing the above code. I like to keep moving things around, until it becomes possible to save another clock-cycle.
For routines / snippets like this one, I think it really pays to give it some extra care, so that it can be re-used without needing to modify it.
I sometimes miss some of the ARM7/ARM9/ARM11 second-operand features; especially LDR rT,[rB,rI,LSR#16] - and also the ability to subtract the index register from the base register.
It would be real neat to have the LSR on the LDR instruction for fixed-point interpolation.
Subtracting the index register would be particularly useful when writing decompression routines, but also when copying blocks of data manually.
Nice stuff! Brings back lots of memories hand-coding DSP for ARM9 and ARM11 cores - that log N divide and conquer is a generally useful technique for all sorts of operations along these lines so a good one to have in the toolbox.