Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Arm Cortex-M0 assembly programming tips and tricks
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Assembly
  • Cortex-M0
  • branch
  • Cortex-M0+
  • conditional
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Cortex-M0 assembly programming tips and tricks

Jens Bauer
Jens Bauer
August 22, 2016
10 minute read time.

The snippets in this document are no real secrets, your C-compiler probably uses them already.

But if you're writing assembly-code, it's good to have some snippets ready, when you need them.

Substitution for conditional instruction execution

The Cortex-M0 and Cortex-M0+ only have conditional execution of branch instructions.

But sometimes you need code, which takes just as many clock cycles if a condition is true as if it's false.

You can achieve this by synchronizing the branches (eg. fill in nop instructions), but there might be another and sometimes better way.

The numbers in square brackets are clock-cycles.

Here is an example of some generic code that wraps a counter if it reaches a limit:

                    cmp                 r0,r1                       /* [1] check counter against limit */

                    blo                 in_range                    /* [3/1] jump forward if in range */

                    movs                r0,#0                       /* [1] wrap to 0 */

                    nop                                             /* [1] synchronize number of clock cycles used */

in_range:

The above snippet will take 4 clock cycles, no matter whether the branch is taken or not.

We can change the above code, so it uses no branch instructions at all:

                    cmp                 r0,r1                       /* [1] check counter against limit */

                    sbcs                r3,r3,r3                    /* [1] if r0 is less than r1, then r3 is -1, otherwise r3 is 0. */

                    ands                r0,r0,r3                    /* [1] if limit reached, wrap counter */

In the above snippet, we save one clock cycle.

Checking if two values match each other could be done this way:

                    subs                r0,r0,r1                    /* [1] check if r0 and r1 are equal */

                    subs                r0,r0,#1                    /* [1] subtract 1 to get borrow if we have an exact match */

                    sbcs                r4,r4,r4                    /* [1] r4 is 0 if no match, -1 if we have a match */

The above example will probably be the most useful one. However that's only the beginning.

Often, you want to check if a value is between a start and start+length. Here is an example on how you could do that:

                    subs                r0,r0,r1                    /* [1] subtract low limit from value */

                    cmp                 r0,r2                       /* [1] check if value is outside length */

                    sbcs                r1,r1,r1                    /* [1] r1 is 0 if inside the limit, -1 if outside */

But you might want to check if your value is between two absolute values. This will add one instruction and cost an extra clock cycle:

                    subs                r2,r2,r1                    /* [1] make r2 relative to r1 */

                    subs                r0,r0,r1                    /* [1] subtract low limit from value */

                    cmp                 r0,r2                       /* [1] check if value is outside length */

                    sbcs                r1,r1,r1                    /* [1] r1 = 0 if inside both limits, -1 if outside */

Sometimes you only need to check if a value is zero. This is quite easy:

                    cmp                 r0,#1                       /* [1] do we get a borrow if we subtract 1? */

                    sbcs                r1,r1,r1                    /* [1] r1 = 0 if r0 is nonzero, -1 if r0 is zero */

Make the snippets fit your needs by swapping the first and second arguments on the cmp instruction, etc.

If you already have a zero in one of your registers, you can use adcs to get an 'inverse' value which is either 0 or 1.

It does not end here. Make multiple blocks of the above, shift out the resulting value to the carry flag, combine with AND, OR and EOR, then bake at 45 degree Celsius for 13 minutes. You may need to use adds, cmn and adcs in some cases. If you know that one of the registers are zero, you can benefit very much from that.

Load / Store shortcuts

On the Cortex-M0, you might not have the sophisticated post-increment/post-decrement on ldr instructions.

Suppose you really, really need a LDR instruction that reads a 32-bit value and advances the source pointer to the next 32-bit value in the same operation, either because you're running out of space or need to do very many operations in very few clock cycles. The answer is simple, but you may not have thought about it:

                    ldmia               r2!,{r1}                    /* [2] read r1 from address r2 and advance r2. */

The above operation takes exactly the same number of clock cycles as the indexed LDR instruction.

Performance in trivial array data processing can also be improved.

The trivial way to do it, is to read a value (byte, halfword or word), then increment the address pointer, process the value, decrement a counter or compare the address to an end-pointer and finally branch back.

On Cortex-M0, you will need to split the value-reading and address pointer increment in two (on Cortex-M3, you can use a post-update for incrementing the address in the load (or store) instruction.

The code I came up with modifies the length counter and the addresses involved.

The length counter is transformed into a negative offset. The address pointers will be modified to point to the end of the buffer instead of pointing to the beginning:

process:            lsls                r1,r1,#1                    /* [1] transform counter into a length */

                    adds                r0,r0,r1                    /* [1] point source address to the end of the array */

                    rsbs                r1,r1,#0                    /* [1] convert the length to a negative index */

loop:               ldrh                r3,[r0,r1]                  /* [2] read a value from the array */

                    /* (process value here) */

                    adds                r1,r1,#2                    /* [1] advance index, update condition codes */

                    bne                 loop                        /* [3/1] ... keep going until index wraps to 0 */

As you see above, there are only 3 instructions inside the loop: fetch value, update index and the branch.

Normally on Cortex-M0, you would use 4 instructions, which would require one extra clock cycle per iteration.

So a block copy can quickly be reduced to spending 4/5 of the CPU-time it used to spend (that's a 20% reduction).

The above loop can be used for a copy operation by just adding a strh r3,[r2,r1] inside the loop and adds r2,r2,r1 below the lsls instruction.

(For real-world example, see the lz4 decompressor; there's a link at the bottom of this page)

Synchronization

The code in this section is intended to be used from within for instance a timer interrupt.

Imagine that you have a timer-interrupt change a pin automatically, and at certain number of clock-cycles, you will need to read or write GPIO pin values.

You then take the Timer's Counter value and wait until it holds a certain value. Unfortunately, a branch in a loop will not synchronize exactly, but if add a few calculations, it becomes possible.

If you ever need to synchronize code to for instance a timer counter, which runs the same speed as the CPU clock, the following might come in handy:

                    subs                r0,r0,#8                    /* [1] adjust clock cycle count to compensate for used clock cycles in this code */

                    lsls                r1,r0,#1                    /* [1] get clock cycles * 2 */

                    movs                r6,#6                       /* [1] two low bits shifted up by 1 to make it even */

                    ands                r1,r1,r6                    /* [1] mask clock cycle count */

                    eors                r1,r1,r6                    /* [1] invert clock cycle count */

                    add                 pc,pc,r1                    /* [1] check if value is outside length */

                    nop                                             /* [0] never executed */

                    subs                r0,r0,#1                    /* [1] synchronize and decrement counter */

                    subs                r0,r0,#1                    /* [1] synchronize and decrement */

                    subs                r0,r0,#1                    /* [1] synchronize and decrement */

delayLoop:

                    subs                r0,r0,#4                    /* [1] decrement counter */

                    bhs                 delayLoop                   /* [3/1] go round loop until fully synchronized */

Unfortunately, the above code takes a minimum of 8 clock cycles, which must be taken into account.

Remember if you load r0, to also include those clock-cycles in the adjustment. You could include the subtraction in the value you load into r0.

Small snippets

Find closest container value:

                    subs                r1,r0,#1                    /* [1] subtract 1 (we now have 0,1,2,3 instead of 1,2,3,4) */

                    lsrs                r2,r1,#1                    /* [1] get a duplicate and shift bit 1 into bit 0. r2 is now 0 or 1 */

                    orrs                r1,r1,r2                    /* [1] mix the two values, we now have 0, 1 or 3 */

                    adds                r1,r1,#1                    /* [1] increment result, we now have 1, 2 or 4 */

This 'rounds up to the nearest power of two', but only for the values 1 to 4.

It is useful, when you have a unit (that could be a 24-bit pixel), and you need to find out the smallest container word for it (byte, halfword or word).

-For instance if you are trying to find a buffer size for a line of pixels.

It can also be used for other kinds of auto-alignments.

Other optimizations

Think outside the box. Work backwards. Write code that is not just 'traditional'.

Here we have a problem, the immediate value is too large to be used with the adds instruction:

                    mov                 r0,r12                      /* [1] get base address */

                    adds                r0,r0,#120                  /* [1] add offset */

...It can be fixed this way:

                    mov                 r0,#120                     /* [1] get offset */

                    add                 r0,r0,r12                   /* [1] add base address */

IanB from the forum at LPCware contributed these:

AND immediate:

                    movs                r0,#0b01000000              /* [1] load the mask */

                    ands                r0,r0,r2                    /* [1] get the isolated bit(s) */

Bit-testing (note: GCC does this already, which was why I didn't mention it in the original document, but Ian is right, I should mention it):

                    lsrs                r0,r0,#7                    /* [1] test bit 6 (it's shifted to the carry bit) */

                    sbcs                r0,r0,r0                    /* [1] expand the bit value (or you could use bcc/bcs) */

Alternatively, you could shift left:

                    lsls                r3,r0,#1                    /* [1] test bit 31 (it's shifted to the carry bit) */

                    sbcs                r3,r3,r3                    /* [1] expand the bit value (or you could use bcc/bcs) */

Use UXTB and UXTH for isolating low 8-bits and low 16 bits of a word:

                    uxtb                r0,r0                       /* [1] synonym for AND r0,r0,#0x000000ff */

                    uxth                r1,r1                       /* [1] synonym for AND r1,r1,#0x0000ffff */

If you're about to run out of registers

Move base-addresses to high registers, then load the offsets into the low registers and add the base addresses.

Example:

                    lsls                r0,r1,#4                    /* [1] multiply by structure size */

                    add                 r0,r0,r12                   /* [1] add array base address */

                    ldm                 r0,{r0-r3}                  /* [5] get flags, data-pointer, I/O address and register value */

                    str                 r3,[r2]                     /* [2] output new value on GPIO pins */

Notice that the LDM instruction is not a LDMIA in this case; this is because r0 is in the list of destination registers.

Making better use of the high registers

I've been asked if about how to use the high registers (r8..r12) more efficiently.

To find out, let's check the Arm Information Center and see which instructions support the high registers.

According to the documentation, I've found the following instructions support the high registers:

  • add rD,rA,rB
  • cmp rA,rB
  • mov rD,rA
  • bx  rA
  • blx rA
  • msr sprD,rA
  • mrs rD,sprA

There are no high-register instructions that support immediate data.

It is also important to remember that LR can be pushed onto the stack.

The above list suggests that you can first of all save values in the high registers for inner loops, in order to avoid loading and storing in memory.

It also suggests that you use the CMP instruction to compare addresses or counters.

As the previous example shows, you can use ADD for having a base address in a high register and then add the base address onto the index-register.

You can use the high registers for storing jump destinations (eg. pre-calculated addresses) or subroutine addresses, so you do not have to use a low register for that.

You can use the high registers to save values of special purpose registers temporarily, so you can restore them later.

Example:

                    ldr                 r0,=source_address          /* [2] point to source starting address */

                    ldr                 r1,=source_end              /* [2] point to source ending address */

                    mov                 r8,r1                       /* [1] transfer to high register */

                    ldr                 r1,=destination_address     /* [2] point to destination address */

copy_l:             ldmia               r0!,{r2}                    /* [2] read a 32-bit word, advance source pointer */

                    stmia               r1!,{r2}                    /* [2] store the 32-bit word, advance destination pointer */

                    cmp                 r0,r8                       /* [1] did we copy everything yet? */

                    bhi                 copy_l                      /* [3/1] if not, keep copying */

Imagine that in the above example we're restricted to using only r0, r1 and r2, because r3...r7 holds important values that is needed right after the copying is done.

Related articles

  • Writing your own startup code for Cortex-M
  • A fairly quick Count Leading Zeroes for Cortex-M0
Anonymous
  • Sean Dunlevy
    Sean Dunlevy over 6 years ago in reply to Jens Bauer

    I've only just noted that the CMP is jolly useful for 64-bit maths. For adding a pair of low-registers to a pair of hi-registers, the CMP conveniently sets up the C flag thus:

    add r9,r1
    cmp r1,r9
    adcs r0,r0,#00
    add r8,r0

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jens Bauer
    Jens Bauer over 6 years ago in reply to infiniteneslives

    Sorry for the late reply (I no longer get notifications when people write comments and I rarely log in here since the changes).

    As you found out, yes, MOV can use high registers on Cortex-M0.

    -That is an error in the documentation (I requested a correction a few years ago, but I think it never made it there).

    ldr r10,=0x12345678 can not be executed on a Cortex-M0.

    Please make sure that you in your assembly-source file specify that you're building for a Cortex-M0, when you're testing:

        .cpu    cortex-m0

    -Then you should get an error if you're building using the GNU assembler.

    My only advice regarding this is to ... don't assume the documentation is right; try things out on a real microcontroller and see whether things work the way you'd expect them to. ;)

    Remember: Expect that no 32-bit instructions will work on a Cortex-M0. If an instruction is using a high register, it's likely it will be 32-bit (to be sure, look in the ARMv7 architecture documentation; this document does seem pretty solid; likely because both assemblers and disassemblers require this document in order to produce correct assembly/disassembly).

    -If you find an error, please report to ARM immediately, so that the correction can make it to the next update of the documentation.

    I know I've already mentioned it, but the only instructions I know of, that allows the use of the high registers are:

    ADD, CMP and MOV.

    -Thus if you want to save the high registers on the stack, it seems you need to MOV them to a low register before pushing them ... and then pop them and MOV them back to the high registers before leaving your subroutine.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • infiniteneslives
    infiniteneslives over 8 years ago in reply to infiniteneslives

    heheh even the valid examples in the docs list R8 as viable so the docs contradict themselves:

    Restrictions

    In these instructions, Rd, and Rm must only specify R0-R7.

    Example:

    MOV   R8, SP          ; Write value of stack pointer to R8
    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • infiniteneslives
    infiniteneslives over 8 years ago

    Late to the party here, but your list of instructions which have access to high registers doesn't agree with ARM docs..

    Specifically MOV doesn't have access to high registers on cortex M0 (listed in restrictions):

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/BABEFHAE.html


    Your list also excludes the fact that LDR, PC-relative has access to high registers too!

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/BABEFHAE.html

    I ran a quick test trying to compile MOV with r8 and it did compile with arm_none_eabi_gcc ran a quick little test passing values into and out of r8 with MOV and it seems to work!  So maybe the docs have changed erroneously since this was posted..?

    Looking at my C disassembly, r8 is used all over the place as NOP (mov r8, r8).  So seems high registers do work just fine despite the fact the docs say MOV is restricted to r0-r7??  Seems pretty lame for high registers to be excluded for MOV instruction...

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Sean Dunlevy
    Sean Dunlevy over 10 years ago

    Well CAN you speed up code by ensuring that the B instruction is the second 16-bits of the 32-bit read so you don't waste an instruction in the cache - I presume a B empties the pipeline.

    It also has a Nordic Semiconductor Bluetooth chip that ALSO has, you guessed it, a Cortex M0. I'm looking at CELP decompression using the Bluetooth Chip when data comes from the memory-stick (so no bluetooth is being used). It's for audiobooks so Happily, I can use an exhaustive search of the fixed & variable codebooks since encoding doesn't have to be realtime. Well, presuming the compression is done on a decent PC then it will be real-time.

    The 2 projects I have put forward to the BBC are:

    Audiobooks - kids are more likely to listen to a book than sit down and read it. For one thing, it can be done anywhere and headphones help eliminate external distractions and for another, these days, getting kids into books is HARD so this is seen as an easy way in.

    Language Lab - Bluetooth broadcasts speech to whole class & teacher can listen to a single child. It means schools (with their ever tightening budgets) can use a standard classroom AS a language lab.

    I'm also looking to hack the Sandisk memory sticks because they contain a nice ARM7TDMI of which I have several games worth of experience (Gameboy Advance, Gamebody DS). If The CELP decode can be done by the stick itself, it leaves the other CPU(s) free.

    Since the BBC is ordering 1 million+ of these machines, we can ask for certain modifications.

    If anyone would like to comment on the above, I would be most appreciative. Thank you all.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
Architectures and Processors blog
  • Future Architecture Technologies: POE2 and vMTE

    Martin Weidmann
    Martin Weidmann
    This blog post introduces two future technologies, Permission Overlay Extension version 2 (POE2) and Virtual Tagging Extension (vMTE).
    • October 23, 2025
  • Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

    Chris Walsh
    Chris Walsh
    Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
    • October 3, 2025
  • Arm A-Profile Architecture developments 2025

    Martin Weidmann
    Martin Weidmann
    Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
    • October 2, 2025