Thumb-1 assembler routines implementing some useful Thumb-2 instructions

November 28, 2016

6 minute read time.

If you're using Cortex-M0+, I bet you miss some instructions from the more capable Thumb-2 instruction set. At least occasionally you have searched for routines from various places: books like Hacker's Delight, and web pages like Bit Twiddling Hacks. I did, and I have collected several more useful ones and implemented them in assembler. I'd like to share it on this page for anyone who are interested (attached). Please note though: if you're not after extreme optimizations and the compiler has the function, the builtins are more convenient.

Many enhanced Thumb-2 instructions are for speed and efficiency. They can be replaced with 2 or 3 Thumb-1 instructions at the cost of cycles, (sometimes) bytes and increased register pressure. These are OK since the costs are insignificant and the solutions are simple. Compilers are doing the job well and we don't often notice. Some other instructions need more efforts than a short, straight sequence, and if done wrong they can be slow. I'm not usually an assembler guy but I think they deserve some proper assembler treatment -- our "opponent" is doing it in a single cycle yet we push many registers before doing the real job?

Another reason of the worthiness is because M0+ is a great architecture: efficient interrupt handling, 1-cycle 32-bit wide multiplier, branch-happy short pipeline, ultralow power, and the low prices. It will continue to serve the low end well and the upcoming V8-M.base would largely keep the same Thumb-1 set (regarding userland features) save the inclusion of DIV. So it's late 2016 but it's not too late.

This mini library is shared under the MIT license. You're welcome to use it freely and of course, I can not take any responsibilities.

Here is the manifest:

Function	Thumb-2 equivalent	Size (code+rodata)	Cycles	Note
thumb1_clz	CLZ	52 bytes	16~19	Count Leading Zeros, from Jens Bauer's CLZ article.
thumb1_ctz	RBIT; CLZ	60 bytes	13 or 11	Count Trailing Zeros, from Bit Twiddling Hacks.
thumb1_rbit	RBIT	52 bytes	24	Reverse Bits, from Hacker's Delight.
thumb1_umulh	UMULL, high part	40 bytes	21	High 32-bit result of unsigned 32x32->64, source ditto.
thumb1_smulh	SMULL, high part	40 bytes	21	Signed version of above, source ditto.
thumb1_udiv16	UDIV, smaller range	(shared with below)	122~138	Unsigned 32/16->16 division, straightforward shift-subtract algorithm.
thumb1_sdiv16	SDIV, smaller range	54 bytes	130~148	Above with sign handling.

All routines do not use the stack. The total size is just ~300 bytes, and they are put into subsections so the linker can easily discard unused ones.

I choose the 32 by 16 divisions mostly because 1, small divisions are quite frequent and 2, there is a fast divider (about 29~244 cycles) for 32-bit, and a micro but slower one (about 345~500 cycles). So I want to compliment them with a different approach. When I need 32-bit division I would use one of the other two (please don't ask where they are. I'm not supposed to mention other guys' cycles am I ).

The CTZ routine is quite interesting, listed below. Yeah the code is in inline assembler for easier tooling, make and the C preprocessor stuff etc.

_naked _sect(".text.thumb1_ctz")
int thumb1_ctz(uint32_t x)
{
    __asm(
        "cmp    r0, #0" br
        "beq    1f" br
        ".thumb_func" br
        ".global thumb1_ctznz" br
        "thumb1_ctznz:" br
        "rsbs   r1, r0, #0" br  // (1)
        "ands   r0, r1" br      // (1)
        "ldr    r1, =0x077cb531" br // (2) de Bruijn sequence.
        "muls   r0, r1" br      // (2)
        "adr    r1, .L_ctzlut" br
        "lsrs   r0, #27" br     // (3)
        "ldrb   r0, [r1, r0]" br // (3)
        "bx     lr" br
        "1:" br
        "movs   r0, #32" br
        "bx     lr" br
        ".ltorg" br
        ".L_ctzlut:" br
        ".byte 0, 1, 28, 2, 29, 14, 24, 3" br
        ".byte 30, 22, 20, 15, 25, 17, 4, 8" br
        ".byte 31, 27, 13, 23, 21, 19, 16, 7" br
        ".byte 26, 12, 18, 6, 11, 5, 10, 9" br
    );
}

(1) is x&(-x), it isolates the least significant bit 1 from a word. For example 0x12340 gives 0x40.

(2) multiply this bit with a magic constant, 0x077cb531, and

(3) take the top five bits off the product, use it as the index to load a byte from a table. Job done.

So the magic is really in the constant. Bithacks mentioned it's a "de Bruijn sequence", but what does it mean? We take a look at its binary representation, and since the bit after step (1) is a power of 2, multiplying it with the number is a shift: *2^i is shifting left i positions. The following are how the top 5 bits change when shifted 0 to 31 steps:

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

......

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0

The series is: 0, 1, 3, 7, 14, 29, 27, 23, 15, 31, 30, 28, 25, 18, 5, 11, 22, 13, 26, 21, 10, 20, 9, 19, 6, 12, 24, 17, 2, 4, 8, 16. It is the full 0~31 without duplication! This is the magic that makes it work.

thumb1-20161129.zip

2 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Thumb-1 assembler routines implementing some useful Thumb-2 instructions

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC