If you're using Cortex-M0+, I bet you miss some instructions from the more capable Thumb-2 instruction set. At least occasionally you have searched for routines from various places: books like Hacker's Delight, and web pages like Bit Twiddling Hacks. I did, and I have collected several more useful ones and implemented them in assembler. I'd like to share it on this page for anyone who are interested (attached). Please note though: if you're not after extreme optimizations and the compiler has the function, the builtins are more convenient.
Many enhanced Thumb-2 instructions are for speed and efficiency. They can be replaced with 2 or 3 Thumb-1 instructions at the cost of cycles, (sometimes) bytes and increased register pressure. These are OK since the costs are insignificant and the solutions are simple. Compilers are doing the job well and we don't often notice. Some other instructions need more efforts than a short, straight sequence, and if done wrong they can be slow. I'm not usually an assembler guy but I think they deserve some proper assembler treatment -- our "opponent" is doing it in a single cycle yet we push many registers before doing the real job?
Another reason of the worthiness is because M0+ is a great architecture: efficient interrupt handling, 1-cycle 32-bit wide multiplier, branch-happy short pipeline, ultralow power, and the low prices. It will continue to serve the low end well and the upcoming V8-M.base would largely keep the same Thumb-1 set (regarding userland features) save the inclusion of DIV. So it's late 2016 but it's not too late.
This mini library is shared under the MIT license. You're welcome to use it freely and of course, I can not take any responsibilities.
Here is the manifest:
All routines do not use the stack. The total size is just ~300 bytes, and they are put into subsections so the linker can easily discard unused ones.
I choose the 32 by 16 divisions mostly because 1, small divisions are quite frequent and 2, there is a fast divider (about 29~244 cycles) for 32-bit, and a micro but slower one (about 345~500 cycles). So I want to compliment them with a different approach. When I need 32-bit division I would use one of the other two (please don't ask where they are. I'm not supposed to mention other guys' cycles am I ).
The CTZ routine is quite interesting, listed below. Yeah the code is in inline assembler for easier tooling, make and the C preprocessor stuff etc.
_naked _sect(".text.thumb1_ctz") int thumb1_ctz(uint32_t x) { __asm( "cmp r0, #0" br "beq 1f" br ".thumb_func" br ".global thumb1_ctznz" br "thumb1_ctznz:" br "rsbs r1, r0, #0" br // (1) "ands r0, r1" br // (1) "ldr r1, =0x077cb531" br // (2) de Bruijn sequence. "muls r0, r1" br // (2) "adr r1, .L_ctzlut" br "lsrs r0, #27" br // (3) "ldrb r0, [r1, r0]" br // (3) "bx lr" br "1:" br "movs r0, #32" br "bx lr" br ".ltorg" br ".L_ctzlut:" br ".byte 0, 1, 28, 2, 29, 14, 24, 3" br ".byte 30, 22, 20, 15, 25, 17, 4, 8" br ".byte 31, 27, 13, 23, 21, 19, 16, 7" br ".byte 26, 12, 18, 6, 11, 5, 10, 9" br ); }
(1) is x&(-x), it isolates the least significant bit 1 from a word. For example 0x12340 gives 0x40.
(2) multiply this bit with a magic constant, 0x077cb531, and
(3) take the top five bits off the product, use it as the index to load a byte from a table. Job done.
So the magic is really in the constant. Bithacks mentioned it's a "de Bruijn sequence", but what does it mean? We take a look at its binary representation, and since the bit after step (1) is a power of 2, multiplying it with the number is a shift: *2^i is shifting left i positions. The following are how the top 5 bits change when shifted 0 to 31 steps:
0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1
......
0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0
0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0
0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0
0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0
The series is: 0, 1, 3, 7, 14, 29, 27, 23, 15, 31, 30, 28, 25, 18, 5, 11, 22, 13, 26, 21, 10, 20, 9, 19, 6, 12, 24, 17, 2, 4, 8, 16. It is the full 0~31 without duplication! This is the magic that makes it work.
@Mr. Spock Yes there is an error in thumb1_smulh(), a LSRS was mistakenly written as ASRS when converting from unsigned thumb1_umulh() to the signed version. Aside from your (0x7ffffffff, -1) case, (-1, -1) also catches this bug.
Due to recent change of the community user account I can't edit this post anymore, nor can I add attachment in this reply. Here's a one-line patch to fix this problem:
diff -u thumb1-20161129/thumb1.c thumb1-20170223/thumb1.c
--- thumb1-20161129/thumb1.c 2016-11-29 11:40:40.000000000 +0800
+++ thumb1-20170223/thumb1.c 2017-02-23 14:16:10.287823700 +0800
@@ -176,7 +176,7 @@
"muls r2, r0" br
"asrs r1, #16" br
"muls r0, r1" br
- "asrs r3, #16" br
+ "lsrs r3, #16" br
"muls r1, r4" br
"adds r3, r2" br
"mov r4, r12" br
Thank you for this bug.
Please check function: thumb1_smulh()
The function return : 0xFFFFFFFE when multiplying (2147483647) by (-1).
The correct result should be 0xFFFFFFFF
(2147483647 * -1 =0xFFFFFFFF,80000001
Thank you