Some of us need to find out how many leading zero-bits there are in a 32-bit word. Such a feature is useful on many occasions, especially when writing a fast divide subroutine.
The Cortex-M3 and later have a CLZ instruction which can count those leading zeroes. But as you probably already know, the Cortex-M0 does not have an instruction for this. Instead you must use either a provided subroutine (usually written in C) or write your own.
Let's try rolling our own, shall we ?
First, let's define a few macros, so that we can make our source-code look nice and tidy.
It is a good idea to use unified syntax; start with this line at the top of your file, in order to make the assembler automatically create IT instructions:
.syntax unified
.macro FUNCTION name .global \name .func \name,\name .type \name,%function .thumb_func
\name\(): .endm .macro ENDFUNC name .pool .size \name,. - \name .endfunc .endm
A simple 'naive' solution would be to count each bit. While this is slow, it's actually very short, so it's worth mentioning.
If you've read any of my earlier documents, you will know that I tend to mention the number of clock cycles that an instruction uses. Those are the numbers in square brackets.
FUNCTION _clz_naive /* in: r0 = 32-bit word to count leading zero bits in. r1=destroyed. */ movs r1,#32 /* [1] assume we have 32 zero bits */ tst r0,r0 /* [1] check if r0 is zero */ beq clz_done\@ /* [1/3] if it's zero, we're done */ clz_loop\@: subs r1,r1,#1 /* [1] decrement leading zero bit count */ lsrs r0,r0,#1 /* [1] shift r0 to the right */ bne clz_loop\@ /* [1/3] if r0 didn't become zero yet, have another go at it */ clz_done\@: movs r0,r1 /* [1] transfer result to r0 */ ENDFUNC _clz_naive /* out: result. Total CPU time spent: between 6 and 162 clock cycles) */
Worst case for the above snippet is 162 clock cycles, not bad, huh? We can do better than that!
If you've written a routine to count leading zeroes, you are probably familiar with a couple of shortcuts. It's common to use logical AND to check if the first 16 bits are clear, if so, we can save a lot of clock-cycles on testing the first 16 bits alone.
In the same way, the next 8 bits are tested, then 4 bits, 2 bits and 1 bit. But let's also remember that we're using an ARM microcontroller here. As you might know, ARM microcontrollers are pretty fond of bit-shifting.
Can we take advantage of this? Let's try writing a macro, which checks if the top N bits are zero:
.macro check_bits_t bits lsrs r2,r1,#\bits /* [1] test next N bits. This updates the Z flag */ beq skip\@ /* [1/3] if all the bits were zero, jump forward */ subs r0,r0,#\bits /* [1] otherwise decrement our counter (the result) */ mov r1,r2 /* [1] and keep the new value */ skip\@: .endm /* (4) this macro will always use 4 clock cycles */
So what we do here, is that we take the word containing our bits to test, shift it to the right N times. Let's assume that's 16 times, so a 32-bit word like this ...
00101001010101111010110101001111 (0x2957ad4f)
...would look like this after shifting it...
00000000000000000010100101010111 (0x00002957)
...that means we're testing if all of the high 16 bits are zero. It seems we weren't that lucky this time, though.
But we've still saved a lot of time. We now know that we have less than 16 leading zeroes, correct ? So we subtract 16 from our counter.
Since the high 16 bits did not contain all zeroes, we'll also keep the shifted value, so we can count those bits instead of the low 16 bits. Had the high 16 bits been zero, we would want to get the original value with the low 16 bits intact.
We can now use this macro a couple of times, in order to count all our bits. Let's have a go at it.
First of all, the above macro wants r1 to contain the word we want to count bits in. It also wants r0 to be the counter. As we're subtracting from the counter, we'll start it at 32, because there are 32 bits in a 32-bit word, right ?
.macro _clz rD,rS /* count leading zeroes in rS, result is in rD. Destroys r1 and r2 */ movs r1,\rS /* [1] transfer word to count to r1 */ movs r0,#32 /* [1] we have 32 bits in a 32-bit word, remember ? */ check_bits_t 16 /* [4] test first 16 bits */ check_bits_t 8 /* [4] test bits 31..24 or 15..8 */ check_bits_t 4 /* [4] test bits 31..28, 23..20, 15..12 or 7..4 */ check_bits_t 2 /* [4] test bits 31..30, 27..26, 23..22, 19..18, 15..14, 11..10, 7..6 or 3..2 */ check_bits_t 1 /* [4] test bits 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3 or 1 */ subs \rD,r0,r1 /* [1] subtract the value of the last bit from the result */ .endm /* (26) this macro always uses 26 clock cycles */
We're so lucky this time, that we can supply the macro with two registers, a destination register and a source register, so we can use it like this:
_clz r4,r7
The leading zeroes in r7 will be counted and the result will be placed in r4.
You can also create a subroutine, if you need to use the functionality from C:
FUNCTION _clz _clz /* [26] count the bits in r0 */ bx lr /* [3] return (29 clock cycles) */ ENDFUNC _qclz
The calling instruction will cost 4 or 5 extra clock cycles, depending on how it's done. C-compilers usually pick the more expensive solution. Alright, we're now able to count 32 bits in less than 32 clock cycles; that's quicker than some implementations of the MUL instruction.
But, we can do better than that!
Let's combine the above with a table-lookup. This will cost you!
The price is 256 bytes for a fast solution; and today we want it fast, otherwise we would be satisfied with the above macro, right? Well, the 16-byte table would be useful if you're running out of space and still need to count bits often.
Let's have a look at the lookup-table solution. We'll start out counting the first 16 bits, then 8 bits as we did above.
.macro _qclz /* this macro will destroy r1 and r2 */ movs r1,#32 /* [1] we have 32 bits in a 32-bit word as usual */ lsrs r2,r0,#16 /* [1] test high 16 bits */ beq skip16\@ /* [1/3] if zero, jump forward */ subs r1,r1,#16 /* [1] otherwise decrement our counter (result) */ mov r0,r2 /* [1] and keep new value */ skip16\@: lsrs r2,r0,#8 /* [1] test bits 31..23 or 15..8 */ beq skip8\@ /* [1/3] if zero, jump forward */ subs r1,r1,#8 /* [1] otherwise decrement our counter (result) */ mov r0,r2 /* [1] and keep new value */ skip8\@: ldr r2,=_qclz_table /* [2] point to our look-up table */ ldrb r0,[r2,r0] /* [2] convert byte to a count between 0 and 8 */ adds r0,r0,r1 /* [1] add the number of bits we've already counted (8, 16 or 24) */ .endm /* (14) this macro always uses 14 clock cycles */
This time, I didn't make the macro so convenient that it takes parameters. The reason is that r1 is destroyed, so if supplying r1 as parameter, we would not get the desired result. It is possible, though, to write the macro, so it tests if we're using r1, and then it could swap the use of r1 and r2; but it would ruin the clarity of this document.
If you want to make the macro check which register is used, you can use the '.ifc' directive. It goes something like this:
.ifc "\rS","r1" /* (if the parameter is r1) */ ... .else /* (the parameter is some other register) */ ... .endif
The _qclz macro uses only 14 clock cycles; personally, I find that quite acceptable when comparing to the original 'worst case' 162 cycles.
OK, now all, that we need is a table (a desk won't do). Let's create one. We'll start by creating a macro that makes the task easier...
.macro rep_byte byte,count .rept \count .byte \byte .endr .endm
The above macro simply repeats the specified byte <count> times.
Let's generate the table then; since we're already working with macros, we'll continue, so you can put them in a .i include file, that they may be re-used in different code...
.macro _qclz_table _qclz_table: rep_byte 8,1 /* table entry 0 */ rep_byte 7,1 /* table entry 1 */ rep_byte 6,2 /* table entry 2 */ rep_byte 5,4 /* table entry 3 */ rep_byte 4,8 /* table entry 4 */ rep_byte 3,16 /* table entry 5 */ rep_byte 2,32 /* table entry 6 */ rep_byte 1,64 /* table entry 7 */ rep_byte 0,128 /* table entry 8 */ .endm
In order to use the qclz, you can instantiate the macro as a subroutine, or you can use it directly.
Let's imagine that you need to call it from C, so you can do the following:
FUNCTION _qclz _qclz /* [14] count the bits in r0 */ bx lr /* [3] return (17 clock cycles) */ .pool /* we'll need the literal pool close to the function */ _qclz_table /* we'll insert the table here */ .align /* let the assembler insert any necessary padding */ ENDFUNC _qclz
The literal pool might not be necessary, depending on your assembler, because the ldr r2,=_qclz_table can be optimized into an adr instruction.
If using the _qclz subroutine, a call to it using the bl instruction would cost 4 extra clock cycles, so that would be a total of 21 clock cycles.
You can easily create a Count Leading Ones by inverting all 32 bits in the input word.
.macro _qclo /* this macro will destroy r1 and r2 */ mvns r0,r0 /* [1] flip all bits in r0 */ _qclz /* [14] count the leading zero bits (which were actually ones) */ .endm
Remember you can also make a complete duplicate of the _qclz macro and modify it, in order to save the last clock cycle if needed.
You can make a check_bits_l, which uses lsls instead of lsrs, so you can make a Count Trailing Zeroes (_ctz) / Count Trailing Ones (_cto) macro. You can make a _qctz or _qcto implementation that has a lsls instead of the lsrs. You would also need to change the table. Saving space is nice, but I recommend writing 4 optimized macros, that you have ready always.
Some of you may want the full 'library' containing a 256-byte look-up table, a 16-byte look-up table and 3 sets of 4 macros. The reason I am not providing such a full library as an attachment here, is that noone will learn from just including an existing file. It's much more fun to be a part of it yourself and understand each detail of what's going on.
Multiplying value seems to be wrong. That would be 0xFD7049FFUL
I have a platform-specific optimization. The bottom of memory of the M0 has the vectors but then a space. If you can place the start of the 256 byte lookup table at an address below $100, it can be set up using a MOVS Rd,#<imm> instruction. Placing the lookup table somewhere else also means that if you use CLZ a lot, you don't end up needing to place it in multiple literal-pools so it saves a cycle and however many duplicates you would need.
For anyone having problems, I can confirm that what Ling LI says is correct. r1 must be initially loaded with #24, not #32.
Unfortunately I can not edit the post, so I cannot correct the article myself.
I have been working on this little issue for a while. When you have a tiny memory budget, 256 bytes may result in something not fitting in-line thus needing a subroutine and of course using cycles. I noted this in the paper version of the original 'Hacker's Delight'. The work of Julius Goryavsky on Harley's algorithm. static char table [64] = {32,20,19,0,0,18,0,7,10,17,0,0,14,0,06,0,0,9,0,16,0,0,01,26,0,13,0,0,24,05,0,0,0,21,0,8,11,0,15,0,0,0,0,2,27,0,25,0,22,0,12,0,0,3,28,0,23,0,4,29,0,0,30,31}x = x &∼ (x>>16);x = x * oxFD7039FF
return table[x >> 26);
I'm a pure assembly-language coder but it strikes me that the VERY clever Count_Trailing_Zeros is related to this using, I suspect, a De Bruijn-like sequence. I was attracted by the 64 byte table. I don't know how much ASM people use but the M0 has it's own zero-page like properties. If an address can be loaded in a single cycle, the code lookup gets quicker. A 32-byte table for the CLZ (or 256 if this doesn't work) and possible other strategies to use the entire bus bandwidth. I'm working with an M0+ with the 1-cycle multiply and a 64-byte cache so code size (and the memory it accesses) is important because I need 2 DMAs working (MicroSD -> RAM and RAM -> DAC. Obviously the smaller the lookup, the more likely I can place it on the Zero-Page and not call a subroutine i.e. keep it in-line.Many thanks!
lix2ng wrote:_qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?
lix2ng wrote:
_qclz should load initial R1 with #24, since the table entry (for residual number of LZs) is added at the end. Correct?
Well spotted. I don't know how this could get past my eyes, but before I make any corrections, I'll make sure that I didn't intend to use 'subs' instead of 'adds' at the end.
The GCC code is absolutely great, but sometimes it will pay to care for details, because the gain can be great. The smaller a snippet of repeated code gets, the more a single clock cycle means.
btw2 if it's made a routine: replacing LDR with ADR to load the table address can save a cycle and a literal (gcc version does this).
You're right, but you should also know that the GNU assembler will automatically use ADR if the table is near.
I've used LDR in order to make the code a little more flexible, since the optimization should be automatic.
(in fact, "LDR=" is nothing but a pseudo-instruction, so it's not even a real LDR instruction).
-But don't take my word for it; verify the code, so you're sure.