The snippets in this document are no real secrets, your C-compiler probably uses them already.
But if you're writing assembly-code, it's good to have some snippets ready, when you need them.
The Cortex-M0 and Cortex-M0+ only have conditional execution of branch instructions.
But sometimes you need code, which takes just as many clock cycles if a condition is true as if it's false.
You can achieve this by synchronizing the branches (eg. fill in nop instructions), but there might be another and sometimes better way.
The numbers in square brackets are clock-cycles.
Here is an example of some generic code that wraps a counter if it reaches a limit:
cmp r0,r1 /* [1] check counter against limit */
blo in_range /* [3/1] jump forward if in range */
movs r0,#0 /* [1] wrap to 0 */
nop /* [1] synchronize number of clock cycles used */
in_range:
The above snippet will take 4 clock cycles, no matter whether the branch is taken or not.
We can change the above code, so it uses no branch instructions at all:
sbcs r3,r3,r3 /* [1] if r0 is less than r1, then r3 is -1, otherwise r3 is 0. */
ands r0,r0,r3 /* [1] if limit reached, wrap counter */
In the above snippet, we save one clock cycle.
Checking if two values match each other could be done this way:
subs r0,r0,r1 /* [1] check if r0 and r1 are equal */
subs r0,r0,#1 /* [1] subtract 1 to get borrow if we have an exact match */
sbcs r4,r4,r4 /* [1] r4 is 0 if no match, -1 if we have a match */
The above example will probably be the most useful one. However that's only the beginning.
Often, you want to check if a value is between a start and start+length. Here is an example on how you could do that:
subs r0,r0,r1 /* [1] subtract low limit from value */
cmp r0,r2 /* [1] check if value is outside length */
sbcs r1,r1,r1 /* [1] r1 is 0 if inside the limit, -1 if outside */
But you might want to check if your value is between two absolute values. This will add one instruction and cost an extra clock cycle:
subs r2,r2,r1 /* [1] make r2 relative to r1 */
sbcs r1,r1,r1 /* [1] r1 = 0 if inside both limits, -1 if outside */
Sometimes you only need to check if a value is zero. This is quite easy:
cmp r0,#1 /* [1] do we get a borrow if we subtract 1? */
sbcs r1,r1,r1 /* [1] r1 = 0 if r0 is nonzero, -1 if r0 is zero */
Make the snippets fit your needs by swapping the first and second arguments on the cmp instruction, etc.
If you already have a zero in one of your registers, you can use adcs to get an 'inverse' value which is either 0 or 1.
It does not end here. Make multiple blocks of the above, shift out the resulting value to the carry flag, combine with AND, OR and EOR, then bake at 45 degree Celsius for 13 minutes. You may need to use adds, cmn and adcs in some cases. If you know that one of the registers are zero, you can benefit very much from that.
On the Cortex-M0, you might not have the sophisticated post-increment/post-decrement on ldr instructions.
Suppose you really, really need a LDR instruction that reads a 32-bit value and advances the source pointer to the next 32-bit value in the same operation, either because you're running out of space or need to do very many operations in very few clock cycles. The answer is simple, but you may not have thought about it:
ldmia r2!,{r1} /* [2] read r1 from address r2 and advance r2. */
The above operation takes exactly the same number of clock cycles as the indexed LDR instruction.
Performance in trivial array data processing can also be improved.
The trivial way to do it, is to read a value (byte, halfword or word), then increment the address pointer, process the value, decrement a counter or compare the address to an end-pointer and finally branch back.
On Cortex-M0, you will need to split the value-reading and address pointer increment in two (on Cortex-M3, you can use a post-update for incrementing the address in the load (or store) instruction.
The code I came up with modifies the length counter and the addresses involved.
The length counter is transformed into a negative offset. The address pointers will be modified to point to the end of the buffer instead of pointing to the beginning:
process: lsls r1,r1,#1 /* [1] transform counter into a length */
adds r0,r0,r1 /* [1] point source address to the end of the array */
rsbs r1,r1,#0 /* [1] convert the length to a negative index */
loop: ldrh r3,[r0,r1] /* [2] read a value from the array */
/* (process value here) */
adds r1,r1,#2 /* [1] advance index, update condition codes */
bne loop /* [3/1] ... keep going until index wraps to 0 */
As you see above, there are only 3 instructions inside the loop: fetch value, update index and the branch.
Normally on Cortex-M0, you would use 4 instructions, which would require one extra clock cycle per iteration.
So a block copy can quickly be reduced to spending 4/5 of the CPU-time it used to spend (that's a 20% reduction).
The above loop can be used for a copy operation by just adding a strh r3,[r2,r1] inside the loop and adds r2,r2,r1 below the lsls instruction.
(For real-world example, see the lz4 decompressor; there's a link at the bottom of this page)
The code in this section is intended to be used from within for instance a timer interrupt.
Imagine that you have a timer-interrupt change a pin automatically, and at certain number of clock-cycles, you will need to read or write GPIO pin values.
You then take the Timer's Counter value and wait until it holds a certain value. Unfortunately, a branch in a loop will not synchronize exactly, but if add a few calculations, it becomes possible.
If you ever need to synchronize code to for instance a timer counter, which runs the same speed as the CPU clock, the following might come in handy:
subs r0,r0,#8 /* [1] adjust clock cycle count to compensate for used clock cycles in this code */
lsls r1,r0,#1 /* [1] get clock cycles * 2 */
movs r6,#6 /* [1] two low bits shifted up by 1 to make it even */
ands r1,r1,r6 /* [1] mask clock cycle count */
eors r1,r1,r6 /* [1] invert clock cycle count */
add pc,pc,r1 /* [1] check if value is outside length */
nop /* [0] never executed */
subs r0,r0,#1 /* [1] synchronize and decrement counter */
subs r0,r0,#1 /* [1] synchronize and decrement */
delayLoop:
subs r0,r0,#4 /* [1] decrement counter */
bhs delayLoop /* [3/1] go round loop until fully synchronized */
Unfortunately, the above code takes a minimum of 8 clock cycles, which must be taken into account.
Remember if you load r0, to also include those clock-cycles in the adjustment. You could include the subtraction in the value you load into r0.
Find closest container value:
subs r1,r0,#1 /* [1] subtract 1 (we now have 0,1,2,3 instead of 1,2,3,4) */
lsrs r2,r1,#1 /* [1] get a duplicate and shift bit 1 into bit 0. r2 is now 0 or 1 */
orrs r1,r1,r2 /* [1] mix the two values, we now have 0, 1 or 3 */
adds r1,r1,#1 /* [1] increment result, we now have 1, 2 or 4 */
This 'rounds up to the nearest power of two', but only for the values 1 to 4.
It is useful, when you have a unit (that could be a 24-bit pixel), and you need to find out the smallest container word for it (byte, halfword or word).
-For instance if you are trying to find a buffer size for a line of pixels.
It can also be used for other kinds of auto-alignments.
Think outside the box. Work backwards. Write code that is not just 'traditional'.
Here we have a problem, the immediate value is too large to be used with the adds instruction:
mov r0,r12 /* [1] get base address */
adds r0,r0,#120 /* [1] add offset */
...It can be fixed this way:
mov r0,#120 /* [1] get offset */
add r0,r0,r12 /* [1] add base address */
IanB from the forum at LPCware contributed these:
AND immediate:
movs r0,#0b01000000 /* [1] load the mask */
ands r0,r0,r2 /* [1] get the isolated bit(s) */
Bit-testing (note: GCC does this already, which was why I didn't mention it in the original document, but Ian is right, I should mention it):
lsrs r0,r0,#7 /* [1] test bit 6 (it's shifted to the carry bit) */
sbcs r0,r0,r0 /* [1] expand the bit value (or you could use bcc/bcs) */
Alternatively, you could shift left:
lsls r3,r0,#1 /* [1] test bit 31 (it's shifted to the carry bit) */
sbcs r3,r3,r3 /* [1] expand the bit value (or you could use bcc/bcs) */
Use UXTB and UXTH for isolating low 8-bits and low 16 bits of a word:
uxtb r0,r0 /* [1] synonym for AND r0,r0,#0x000000ff */
uxth r1,r1 /* [1] synonym for AND r1,r1,#0x0000ffff */
Move base-addresses to high registers, then load the offsets into the low registers and add the base addresses.
Example:
lsls r0,r1,#4 /* [1] multiply by structure size */
add r0,r0,r12 /* [1] add array base address */
ldm r0,{r0-r3} /* [5] get flags, data-pointer, I/O address and register value */
str r3,[r2] /* [2] output new value on GPIO pins */
Notice that the LDM instruction is not a LDMIA in this case; this is because r0 is in the list of destination registers.
I've been asked if about how to use the high registers (r8..r12) more efficiently.
To find out, let's check the Arm Information Center and see which instructions support the high registers.
According to the documentation, I've found the following instructions support the high registers:
There are no high-register instructions that support immediate data.
It is also important to remember that LR can be pushed onto the stack.
The above list suggests that you can first of all save values in the high registers for inner loops, in order to avoid loading and storing in memory.
It also suggests that you use the CMP instruction to compare addresses or counters.
As the previous example shows, you can use ADD for having a base address in a high register and then add the base address onto the index-register.
You can use the high registers for storing jump destinations (eg. pre-calculated addresses) or subroutine addresses, so you do not have to use a low register for that.
You can use the high registers to save values of special purpose registers temporarily, so you can restore them later.
ldr r0,=source_address /* [2] point to source starting address */
ldr r1,=source_end /* [2] point to source ending address */
mov r8,r1 /* [1] transfer to high register */
ldr r1,=destination_address /* [2] point to destination address */
copy_l: ldmia r0!,{r2} /* [2] read a 32-bit word, advance source pointer */
stmia r1!,{r2} /* [2] store the 32-bit word, advance destination pointer */
cmp r0,r8 /* [1] did we copy everything yet? */
bhi copy_l /* [3/1] if not, keep copying */
Imagine that in the above example we're restricted to using only r0, r1 and r2, because r3...r7 holds important values that is needed right after the copying is done.
Hi muffin and welcome to the community!
As far as I know, the Cortex-M0 reads 32 bits at a time, however, all instruction timings are fixed, so normally you can't really get any performance improvements by swapping two instructions.
Turning scoreboarding off is not possible.
It is, however, possible to gain a little extra from swapping/arranging instructions - you may gain some extra free registers, thus you can sometimes benefit from using LDM or STM in place of LDR / STR and then use shifts to extract byte or halfword values.
Shifts on all Cortex-M are done in a single clock cycle, so generally speaking, you won't be able to "optimize" shifts.
-However, you can of course still load halfwords and bytes from different addresses, just beware that maybe some day there will be a Big Endian microcontroller, so if your code needs to be portable, you might want to make two versions of this load.
(In general I do not recommend loading from a different address in place of shifting, but in rare cases, that's the only option).
Regarding branching, the same applies; the instruction timing is fixed on the Cortex-M0 (but not on the Cortex-M3/M4/M7).
You can, however, write your code, so it use the same timing on the Cortex-M0, and schedule the instructions for the Cortex-M3/M4/M7, so the code will run as fast as possible. Remember that all Cortex-M0 instructions are available on the other Cortex-M cores.
Does the Cortex-M0 use a 16-bit bus or a 32-bit bus i.e. are instructions read in pairs? The reason I ask is from SH2 experience, placing memory fetches in the right place saved a pipeline stall. In addition, can scoreboarding be switched of so,after a B or BL to a new PC, the next instruction is still executed. This proved a BIG improvement in the dhrystones/MIP power. I rewrote the GNU SH2 Felide constructors and on games like Tombraider on the Saturn, that trick and optimized shifts (like an SHL17 in C wasn't 17 SHL instructions, it was a swap & a shift. A wide-range of tricks were used to the extent that in the whole code, not a single NOP was found.
I appreciate this doesn't follow the Cortex standard, but adding a switch that disabled score boarding could be added. I'm thinking here specifically of the BBC Microbit on which I am going to attempt a high-quality CELP coded that DOES extensively search both fixed & variable codebooks (i.e. non-realtime) will give optimal results for the decoder. It's for the inclusion of audiobooks on the system..
Thank you all for verifying the SBC flag & carry discrepancy on the M0 core and clearing things up.
I have submitted an errata report on the relevant documents and it has been logged internally. That doesn't mean I can predict a date for a correction to be issued but the document will be updated at some stage.
Thanks for all who spotted it!
I was hoping markj1000 would since they found it and if they say about the wearing out of the scalp then I'm sure they'll get a suitable apology as well as the thanks.