Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Thumb-1 assembler routines implementing some useful Thumb-2 instructions
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Cortex-M0
  • algorithm
  • assembler
  • code
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Thumb-1 assembler routines implementing some useful Thumb-2 instructions

Ling LI
Ling LI
November 28, 2016
6 minute read time.

If you're using Cortex-M0+, I bet you miss some instructions from the more capable Thumb-2 instruction set. At least occasionally you have searched for routines from various places: books like Hacker's Delight, and web pages like Bit Twiddling Hacks. I did, and I have collected several more useful ones and implemented them in assembler.  I'd like to share it on this page for anyone who are interested (attached). Please note though: if you're not after extreme optimizations and the compiler has the function, the builtins are more convenient.

Many enhanced Thumb-2 instructions are for speed and efficiency. They can be replaced with 2 or 3 Thumb-1 instructions at the cost of cycles, (sometimes) bytes and increased register pressure. These are OK since the costs are insignificant and the solutions are simple. Compilers are doing the job well and we don't often notice. Some other instructions need more efforts than a short, straight sequence, and if done wrong they can be slow. I'm not usually an assembler guy but I think they deserve some proper assembler treatment -- our "opponent" is doing it in a single cycle yet we  push many registers before doing the real job?

Another reason of the worthiness is because M0+ is a great architecture: efficient interrupt handling, 1-cycle 32-bit wide multiplier, branch-happy short pipeline, ultralow power, and the low prices. It will continue to serve the low end well and the upcoming V8-M.base would largely keep the same Thumb-1 set (regarding userland features) save the inclusion of DIV. So it's late 2016 but it's not too late.

This mini library is shared under the MIT license. You're welcome to use it freely and of course, I can not take any responsibilities.

Here is the manifest:

Function Thumb-2 equivalent Size (code+rodata) Cycles Note
thumb1_clz CLZ 52 bytes 16~19 Count Leading Zeros, from Jens Bauer's CLZ article.
thumb1_ctz RBIT; CLZ 60 bytes 13 or 11 Count Trailing Zeros, from Bit Twiddling Hacks.
thumb1_rbit RBIT 52 bytes 24 Reverse Bits, from Hacker's Delight.
thumb1_umulh UMULL, high part 40 bytes 21 High 32-bit result of unsigned 32x32->64, source ditto.
thumb1_smulh SMULL, high part 40 bytes 21 Signed version of above, source ditto.
thumb1_udiv16 UDIV, smaller range (shared with below) 122~138 Unsigned 32/16->16 division, straightforward shift-subtract algorithm.
thumb1_sdiv16 SDIV, smaller range 54 bytes 130~148 Above with sign handling.

All routines do not use the stack. The total size is just ~300 bytes, and they are put into subsections so the linker can easily discard unused ones.

I choose the 32 by 16 divisions mostly because 1, small divisions are quite frequent and 2, there is a fast divider (about 29~244 cycles) for 32-bit, and a micro but slower one (about 345~500 cycles). So I want to compliment them with a different approach. When I need 32-bit division I would use one of the other two (please don't ask where they are. I'm not supposed to mention other guys' cycles am I ).

The CTZ routine is quite interesting, listed below. Yeah the code is in inline assembler for easier tooling, make and the C preprocessor stuff etc.

_naked _sect(".text.thumb1_ctz")
int thumb1_ctz(uint32_t x)
{
    __asm(
        "cmp    r0, #0" br
        "beq    1f" br
        ".thumb_func" br
        ".global thumb1_ctznz" br
        "thumb1_ctznz:" br
        "rsbs   r1, r0, #0" br  // (1)
        "ands   r0, r1" br      // (1)
        "ldr    r1, =0x077cb531" br // (2) de Bruijn sequence.
        "muls   r0, r1" br      // (2)
        "adr    r1, .L_ctzlut" br
        "lsrs   r0, #27" br     // (3)
        "ldrb   r0, [r1, r0]" br // (3)
        "bx     lr" br
        "1:" br
        "movs   r0, #32" br
        "bx     lr" br
        ".ltorg" br
        ".L_ctzlut:" br
        ".byte 0, 1, 28, 2, 29, 14, 24, 3" br
        ".byte 30, 22, 20, 15, 25, 17, 4, 8" br
        ".byte 31, 27, 13, 23, 21, 19, 16, 7" br
        ".byte 26, 12, 18, 6, 11, 5, 10, 9" br
    );
}
 

(1) is x&(-x), it isolates the least significant bit 1 from a word. For example 0x12340 gives 0x40.

(2) multiply this bit with a magic constant, 0x077cb531, and

(3) take the top five bits off the product, use it as the index to load a byte from a table. Job done.

So the magic is really in the constant. Bithacks mentioned it's a "de Bruijn sequence", but what does it mean? We take a look at its binary representation, and since the bit after step (1) is a power of 2, multiplying it with the number is a shift: *2^i is shifting left i positions. The following are how the top 5 bits change when shifted 0 to 31 steps:

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

......

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0

0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0

The series is: 0, 1, 3, 7, 14, 29, 27, 23, 15, 31, 30, 28, 25, 18, 5, 11, 22, 13, 26, 21, 10, 20, 9, 19, 6, 12, 24, 17, 2, 4, 8, 16. It is the full 0~31 without duplication! This is the magic that makes it work.

 
thumb1-20161129.zip
Anonymous
Parents
  • Ling LI
    Ling LI over 8 years ago

    @Mr. Spock Yes there is an error in thumb1_smulh(), a LSRS was mistakenly written as ASRS when converting from unsigned thumb1_umulh() to the signed version. Aside from your (0x7ffffffff, -1) case, (-1, -1) also catches this bug.

    Due to recent change of the community user account I can't edit this post anymore, nor can I add attachment in this reply. Here's a one-line patch to fix this problem:

    diff -u thumb1-20161129/thumb1.c thumb1-20170223/thumb1.c

    --- thumb1-20161129/thumb1.c    2016-11-29 11:40:40.000000000 +0800

    +++ thumb1-20170223/thumb1.c    2017-02-23 14:16:10.287823700 +0800

    @@ -176,7 +176,7 @@

            "muls   r2, r0" br

            "asrs   r1, #16" br

            "muls   r0, r1" br

    -        "asrs   r3, #16" br

    +        "lsrs   r3, #16" br

            "muls   r1, r4" br

            "adds   r3, r2" br

            "mov    r4, r12" br

    Thank you for this bug.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • Ling LI
    Ling LI over 8 years ago

    @Mr. Spock Yes there is an error in thumb1_smulh(), a LSRS was mistakenly written as ASRS when converting from unsigned thumb1_umulh() to the signed version. Aside from your (0x7ffffffff, -1) case, (-1, -1) also catches this bug.

    Due to recent change of the community user account I can't edit this post anymore, nor can I add attachment in this reply. Here's a one-line patch to fix this problem:

    diff -u thumb1-20161129/thumb1.c thumb1-20170223/thumb1.c

    --- thumb1-20161129/thumb1.c    2016-11-29 11:40:40.000000000 +0800

    +++ thumb1-20170223/thumb1.c    2017-02-23 14:16:10.287823700 +0800

    @@ -176,7 +176,7 @@

            "muls   r2, r0" br

            "asrs   r1, #16" br

            "muls   r0, r1" br

    -        "asrs   r3, #16" br

    +        "lsrs   r3, #16" br

            "muls   r1, r4" br

            "adds   r3, r2" br

            "mov    r4, r12" br

    Thank you for this bug.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Architectures and Processors blog
  • Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

    Chris Walsh
    Chris Walsh
    Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
    • October 3, 2025
  • Arm A-Profile Architecture developments 2025

    Martin Weidmann
    Martin Weidmann
    Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
    • October 2, 2025
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025