Looking for alternates for this instruction.
It is worth noting that the CLZ implementation as shown is never called with, and does not support being called with, the value zero within CMSIS.
The Cortex-M0 supports register based shifts, which does permit reasonably fast and deterministic implementations without requiring look-up tables, for example:
uint32_t clz(uint32_t data) { uint32_t count, shift, value; count = 31; // MOVS Rc,#31 #ifdef CLZ_SUPPORT_ZERO if(data == 0) // CMP Rd,#0 { // BNE %1 count = 32; // MOVS Rc,#32 } // 1: #endif for(shift=16;shift;shift>>=1) // MOVS Rs,#16 { // 2: <--+ value = data >> shift; // MOV Rv,Rd | // LSRS Rv,Rs | if(value) { // BEQ %3 | data = value; // MOV Rd,Rv x5 count = count - shift; // SUBS Rc,Rs | } // 3: | } // LSRS Rs,#1 | // BNE %2 ----+ return count; // MOV Rd,Rc } // BX lr
Support for correctly returning a result of 32 for the value zero can be enabled by defining CLZ_SUPPORT_ZERO.
The best choice of implementation will depend on your particular constraints and anticipated data-set. Dependent on input value, on a Cortex-M0+, the above code should take between 41 and 47 cycles or 38 and 44 cycles, with or without zero support respectively, and consume between 22 and 28 bytes; the CMSIS code should take between 10 and 137 cycles and require 24 bytes; Miro's suggestion should always take around 17 cycles, but consume around 308 bytes.
hth
Simon.
Hello,
In CMSIS, Arm_Math.h, there is defined an alternative which you may use. There are some #defines around it, so if you're having trouble linking it, make sure you have the correct definitions, or just copy and paste __CLZ into your code somewhere.
static __INLINE uint32_t __CLZ( q31_t data) { uint32_t count = 0; uint32_t mask = 0x80000000; while((data & mask) == 0) { count += 1u; mask = mask >> 1u; } return (count); }
Cheers,
Dan
If you are interested in a better performance that beats the pants out of the brute force solution proposed above, you might use the following implementation:
static __INLINE uint32_t __CLZ(uint32_t x) { extern uint8_t const log2Lkup[256]; if (x >= 0x00010000U) { if (x >= 0x01000000U) { return 8U - log2Lkup[x >> 24]; } else { return 16U - log2Lkup[x >> 16]; } } else { if (x >= 0x00000100U) { return 24U - log2Lkup[x >> 8]; } else { return 32U - log2Lkup[x]; } } }
The function would need the log2 (binary logarithm) lookup table defined in a .c file:
uint8_t const log2Lkup[256] = { 0U, 1U, 2U, 2U, 3U, 3U, 3U, 3U, 4U, 4U, 4U, 4U, 4U, 4U, 4U, 4U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 5U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 6U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 7U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U, 8U };
The __CLZ() implementation proposed above does not have loops and is deterministic, meaning that it takes the same number of instructions for all arguments x. It is better than most other algorithms you can find online, including the methods from the "Hacker's Delight" and Anderson's bit twiddling hacks (interested folks can google for these).
--Miro
If the goal is to reduce the code size, for the price of just one more comparison in the critical path of the code, the size of the log2Lkup lookup table can be reduced from 256 to just 16 bytes.
So here is the deterministic code that takes about 94 bytes of code space, including the lookup table:
static inline uint32_t CLZ(uint32_t x) { static uint8_t const log2Lkup[] = { 0U, 1U, 2U, 2U, 3U, 3U, 3U, 3U, 4U, 4U, 4U, 4U, 4U, 4U, 4U, 4U }; uint32_t n; if (x >= (1U << 16)) { if (x >= (1U << 24)) { if (x >= (1 << 28)) { n = 28U; } else { n = 24U; } } else { if (x >= (1U << 20)) { n = 20U; } else { n = 16U; } } } else { if (x >= (1U << 8)) { if (x >= (1U << 12)) { n = 12U; } else { n = 8U; } } else { if (x >= (1U << 4)) { n = 4U; } else { n = 0U; } } } return 32U - n - log2Lkup[x >> n]; }
Here is the disassembly of the CLZ function for the Cortex-M0 core generated with IAR EWARM (high level of optimization):
volatile uint32_t y = CLZ(x); 0x272: 0x9800 LDR R0, [SP] if (x >= (1U << 16)) { 0x274: 0x2280 MOVS R2, #128 ; 0x80 0x276: 0x0252 LSLS R2, R2, #9 0x278: 0x4290 CMP R0, R2 0x27a: 0xd310 BCC.N ??main_0 ; 0x29e if (x >= (1U << 24)) { 0x27c: 0x0211 LSLS R1, R2, #8 0x27e: 0x4288 CMP R0, R1 0x280: 0xd306 BCC.N ??main_1 ; 0x290 if (x >= (1 << 28)) { 0x282: 0x0109 LSLS R1, R1, #4 0x284: 0x4288 CMP R0, R1 0x286: 0xd301 BCC.N ??main_2 ; 0x28c n = 28U; 0x288: 0x211c MOVS R1, #28 ; 0x1c 0x28a: 0xe014 B.N ??main_3 ; 0x2b6 n = 24U; ??main_2: 0x28c: 0x2118 MOVS R1, #24 ; 0x18 0x28e: 0xe012 B.N ??main_3 ; 0x2b6 if (x >= (1U << 20)) { ??main_1: 0x290: 0x0909 LSRS R1, R1, #4 0x292: 0x4288 CMP R0, R1 0x294: 0xd301 BCC.N ??main_4 ; 0x29a n = 20U; 0x296: 0x2114 MOVS R1, #20 ; 0x14 0x298: 0xe00d B.N ??main_3 ; 0x2b6 n = 16U; ??main_4: 0x29a: 0x2110 MOVS R1, #16 ; 0x10 0x29c: 0xe00b B.N ??main_3 ; 0x2b6 if (x >= (1U << 8)) { ??main_0: 0x29e: 0x28ff CMP R0, #255 ; 0xff 0x2a0: 0xd904 BLS.N ??main_5 ; 0x2ac if (x >= (1U << 12)) { 0x2a2: 0x0912 LSRS R2, R2, #4 0x2a4: 0x4290 CMP R0, R2 0x2a6: 0xd206 BCS.N ??main_3 ; 0x2b6 n = 8U; 0x2a8: 0x2108 MOVS R1, #8 0x2aa: 0xe004 B.N ??main_3 ; 0x2b6 if (x >= (1U << 4)) { ??main_5: 0x2ac: 0x2810 CMP R0, #16 ; 0x10 0x2ae: 0xd301 BCC.N ??main_6 ; 0x2b4 n = 4U; 0x2b0: 0x2104 MOVS R1, #4 0x2b2: 0xe000 B.N ??main_3 ; 0x2b6 n = 0U; ??main_6: 0x2b4: 0x2100 MOVS R1, #0 ??main_3: 0x2b6: 0x2220 MOVS R2, #32 ; 0x20 0x2b8: 0x1a52 SUBS R2, R2, R1 0x2ba: 0x40c8 LSRS R0, R0, R1 0x2bc: 0xa129 ADR.N R1, ??log2Lkup ; 0x364 0x2be: 0x5c08 LDRB R0, [R1, R0] 0x2c0: 0x1a10 SUBS R0, R2, R0 0x2c2: 0x9000 STR R0, [SP]
By my count, this code executes always in just 15 instructions. The number of clock cycles will be of course slightly bigger, especially for branch instructions that break the instruction pipeline.
For an assembly language subroutine (which only takes 14 clock cycles), see A fairly quick Count Leading Zeroes for Cortex-M0.