Thumb-1 assembler routines implementing some useful Thumb-2 instructions

November 28, 2016

6 minute read time.

If you're using Cortex-M0+, I bet you miss some instructions from the more capable Thumb-2 instruction set. At least occasionally you have searched for routines from various places: books like Hacker's Delight, and web pages like Bit Twiddling Hacks. I did, and I have collected several more useful ones and implemented them in assembler. I'd like to share it on this page for anyone who are interested (attached). Please note though: if you're not after extreme optimizations and the compiler has the function, the builtins are more convenient.

Many enhanced Thumb-2 instructions are for speed and efficiency. They can be replaced with 2 or 3 Thumb-1 instructions at the cost of cycles, (sometimes) bytes and increased register pressure. These are OK since the costs are insignificant and the solutions are simple. Compilers are doing the job well and we don't often notice. Some other instructions need more efforts than a short, straight sequence, and if done wrong they can be slow. I'm not usually an assembler guy but I think they deserve some proper assembler treatment -- our "opponent" is doing it in a single cycle yet we push many registers before doing the real job?

Another reason of the worthiness is because M0+ is a great architecture: efficient interrupt handling, 1-cycle 32-bit wide multiplier, branch-happy short pipeline, ultralow power, and the low prices. It will continue to serve the low end well and the upcoming V8-M.base would largely keep the same Thumb-1 set (regarding userland features) save the inclusion of DIV. So it's late 2016 but it's not too late.

This mini library is shared under the MIT license. You're welcome to use it freely and of course, I can not take any responsibilities.

Here is the manifest:

Function	Thumb-2 equivalent	Size (code+rodata)	Cycles	Note
thumb1_clz	CLZ	52 bytes	16~19	Count Leading Zeros, from Jens Bauer's CLZ article.
thumb1_ctz	RBIT; CLZ	60 bytes	13 or 11	Count Trailing Zeros, from Bit Twiddling Hacks.
thumb1_rbit	RBIT	52 bytes	24	Reverse Bits, from Hacker's Delight.
thumb1_umulh	UMULL, high part	40 bytes	21	High 32-bit result of unsigned 32x32->64, source ditto.
thumb1_smulh	SMULL, high part	40 bytes	21	Signed version of above, source ditto.
thumb1_udiv16	UDIV, smaller range	(shared with below)	122~138	Unsigned 32/16->16 division, straightforward shift-subtract algorithm.
thumb1_sdiv16	SDIV, smaller range	54 bytes	130~148	Above with sign handling.

All routines do not use the stack. The total size is just ~300 bytes, and they are put into subsections so the linker can easily discard unused ones.

I choose the 32 by 16 divisions mostly because 1, small divisions are quite frequent and 2, there is a fast divider (about 29~244 cycles) for 32-bit, and a micro but slower one (about 345~500 cycles). So I want to compliment them with a different approach. When I need 32-bit division I would use one of the other two (please don't ask where they are. I'm not supposed to mention other guys' cycles am I ).

The CTZ routine is quite interesting, listed below. Yeah the code is in inline assembler for easier tooling, make and the C preprocessor stuff etc.

_naked _sect(".text.thumb1_ctz")
int thumb1_ctz(uint32_t x)
{
    __asm(
        "cmp    r0, #0" br
        "beq    1f" br
        ".thumb_func" br
        ".global thumb1_ctznz" br
        "thumb1_ctznz:" br
        "rsbs   r1, r0, #0" br  // (1)
        "ands   r0, r1" br      // (1)
        "ldr    r1, =0x077cb531" br // (2) de Bruijn sequence.
        "muls   r0, r1" br      // (2)
        "adr    r1, .L_ctzlut" br
        "lsrs   r0, #27" br     // (3)
        "ldrb   r0, [r1, r0]" br // (3)
        "bx     lr" br
        "1:" br
        "movs   r0, #32" br
        "bx     lr" br
        ".ltorg" br
        ".L_ctzlut:" br
        ".byte 0, 1, 28, 2, 29, 14, 24, 3" br
        ".byte 30, 22, 20, 15, 25, 17, 4, 8" br
        ".byte 31, 27, 13, 23, 21, 19, 16, 7" br
        ".byte 26, 12, 18, 6, 11, 5, 10, 9" br
    );
}

(1) is x&(-x), it isolates the least significant bit 1 from a word. For example 0x12340 gives 0x40.

(2) multiply this bit with a magic constant, 0x077cb531, and

(3) take the top five bits off the product, use it as the index to load a byte from a table. Job done.

So the magic is really in the constant. Bithacks mentioned it's a "de Bruijn sequence", but what does it mean? We take a look at its binary representation, and since the bit after step (1) is a power of 2, multiplying it with the number is a shift: *2^i is shifting left i positions. The following are how the top 5 bits change when shifted 0 to 31 steps:

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

......

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0

The series is: 0, 1, 3, 7, 14, 29, 27, 23, 15, 31, 30, 28, 25, 18, 5, 11, 22, 13, 26, 21, 10, 20, 9, 19, 6, 12, 24, 17, 2, 4, 8, 16. It is the full 0~31 without duplication! This is the magic that makes it work.

thumb1-20161129.zip

Parents

Ling LI over 8 years ago

@Mr. Spock Yes there is an error in thumb1_smulh(), a LSRS was mistakenly written as ASRS when converting from unsigned thumb1_umulh() to the signed version. Aside from your (0x7ffffffff, -1) case, (-1, -1) also catches this bug.

Due to recent change of the community user account I can't edit this post anymore, nor can I add attachment in this reply. Here's a one-line patch to fix this problem:

diff -u thumb1-20161129/thumb1.c thumb1-20170223/thumb1.c

--- thumb1-20161129/thumb1.c 2016-11-29 11:40:40.000000000 +0800

+++ thumb1-20170223/thumb1.c 2017-02-23 14:16:10.287823700 +0800

@@ -176,7 +176,7 @@

"muls r2, r0" br

"asrs r1, #16" br

"muls r0, r1" br

- "asrs r3, #16" br

+ "lsrs r3, #16" br

"muls r1, r4" br

"adds r3, r2" br

"mov r4, r12" br

Thank you for this bug.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Ling LI over 8 years ago

@Mr. Spock Yes there is an error in thumb1_smulh(), a LSRS was mistakenly written as ASRS when converting from unsigned thumb1_umulh() to the signed version. Aside from your (0x7ffffffff, -1) case, (-1, -1) also catches this bug.

Due to recent change of the community user account I can't edit this post anymore, nor can I add attachment in this reply. Here's a one-line patch to fix this problem:

diff -u thumb1-20161129/thumb1.c thumb1-20170223/thumb1.c

--- thumb1-20161129/thumb1.c 2016-11-29 11:40:40.000000000 +0800

+++ thumb1-20170223/thumb1.c 2017-02-23 14:16:10.287823700 +0800

@@ -176,7 +176,7 @@

"muls r2, r0" br

"asrs r1, #16" br

"muls r0, r1" br

- "asrs r3, #16" br

+ "lsrs r3, #16" br

"muls r1, r4" br

"adds r3, r2" br

"mov r4, r12" br

Thank you for this bug.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Thumb-1 assembler routines implementing some useful Thumb-2 instructions

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order