This is the beginning of a 5-part series of articles on how to write some quick integer and fixed point math in assembly language for the Cortex-M3, Cortex-M4 and Cortex-M7 microcontrollers.
This article will familiarize you with basic 32-bit math operations, such as addition, subtraction ,multiplication, division, bitwise AND, bitwise OR, bitwise Exclusive OR and bit-shifting.
It will also introduce you to reading and writing values from and to memory. A few other instructions will be dropped in, without focusing too much on what they do; the comments should provide enough information to give you an idea of what they do, in case you do not know already - otherwise just ask in the comment section.
The following instructions are covered in this document:
For these articles, you will need to be familiar with your assembler in advance.
That means: You should be able to write a stand-alone .s file that can be assembled and linked with for instance C files.
If you're already familiar with writing inline-assembly in your C-sources, you can of course do that as well, but for clarity, I've chosen to write the article targeting pure assembly language.
All microcontrollers (well, almost all) have a set of registers. On Cortex-M, there are 16 general purpose registers, they are all 32-bit wide.
These are:
R0...R12, SP, LR and PC.
SP is also known as R13. SP is an abbreviation for Stack Pointer.
LR is also known as R14. LR is an abbreviation for Link Register.
PC is also known as R15. PC is an abbreviation for Program Counter.
SP and PC are treated in a special way. Some operations are not allowed on these registers, since they would not make sense.
LR is also treated in a special way, however, the only thing different is that LR gets slightly more functionality than the other registers.
Since the registers are 32-bit wide, they can hold values from 0 ... +4294967295 or signed values from -2147483648 .. +2147483647.
Bits are numbered 0 to 31, where 0 is the least significant bit, bit 31 is the most significant bit.
For signed values, bit 31 is used to determine if the value is positive (0) or negative (1).
If bit 0 is set, the value is odd, if bit 0 is clear, the value is even.
A quick introduction to the assembler syntax:
Each line may contain a single instruction. (some assemblers allow more, but all assemblers allow this syntax).
An assembly line can be split into fields:
[label field] [operation field] [operand field] [comment field]
Although it's possible to use SPACE characters between fields, I recommend using TABs.
TABs makes it quicker to move your caret through the source code and most editors allow you to change the visible width of a tab, so if suddenly you need to make your label field wider, you do not have to insert spaces in every line throughout your source-code, in order to make it 'readable'.
Example:
setup: mov r0,#0 /* [1][1] compare r0 to r1 */
The numbers in the [square brackets] in the comments, are the number of clock cycles the instruction would use. I tend to write these in all my documents. If you see a single square bracket, it means the number of clock cycles are the same on Cortex-M3 and Cortex-M4. If you see two square brackets, the first one is for Cortex-M3, the second one is for Cortex-M4. If you see a forward slash with numbers on each side of the slash, it means that the instruction is a conditional instruction. The first number is how many clock cycles it takes when the instruction is not executed, the second number is the number of clock cycles it takes if the instruction is executed.
In the old days, the comment field started with a semicolon (;) or an asterisk (*).
But some modern assemblers implement multi-instruction lines, where each instruction is separated by a semicolon, so the semicolon can not be used anymore.
And some modern assemblers now allow you to have whitespaces in the operand field, which means that you would not be able to use *, because the remaining of the line would then be treated as a comment and thus ignored.
To fix this problem, some assemblers use the pound (#) character for comments, but that would not comply with Arm's (and several other architecture's) assembly syntax.
The official Arm comment uses the AT symbol (@), to start a comment, but to make my life easier when writing these articles (because @ triggers the mention-pop-up), I've chosen to use C-style comments. Those can be used if you are using a C pre-processor with your assembler; thus the pre-processor will remove the comments and send the cleaned listing to your assembler.
Remember for later where the store is, so you can get a truck-load of values.
There are two basic instruction types for accessing memory on the Cortex-M series.
Load instructions read values from memory into registers.
Store instructions store values from registers into memory.
The LDR instruction can be used to read memory contents from an address into a register, which another register is pointing to.
It's even possible to add either a constant value (this is called an immediate value) or another register, which can optionally be bit-shifted before adding it, in order to form a source-address.
ldr r0,[r1] /* [2] read a 32-bit word from the memory address that r1 is pointing to into r0 */
So the value of r1 is 0x10000004, and the data at address 0x10000004 in memory contains 3141592653, then r0 will now contain the value 3141592653, which was read from memory.
In the above case, r0 is the destination register and r1 is the source base register.
To read from address 0x10000008 instead, we could use the following instruction:
ldr r0,[r1,#4] /* [2] read a 32-bit word from the memory address r1+4 into r0 */
Here r1 is still the source base register, but we've also specified a constant index, which is called an "immediate value".
The immediate value is added with the base register, to form the final source address. This operation does not change r1.
The square bracket means 'read memory contents'
It is possible to do two operations in one; we can read the memory contents, and update the source base register:
ldr r0,[r1],#4 /* [2] read a 32-bit word from the memory address r1 into r0, then add 4 to r1 */
Notice that the #4 is now on the outside of the square bracket. This means 'after reading memory contents'.
It's also possible to use a register as index:
ldr r0,[r1,r2] /* [2] read a 32-bit word from the memory address r1+r2 into r0 */
Both r1 and r2 are unchanged.
Finally, we can bit-shift the index register, before it is added:
ldr r0,[r1,r2,lsl#2] /* [2] read a 32-bit word from the memory address r1+(r2 << 2) into r0 */
Note: r2 is still unchanged after the operation, the only register that is changed is r0, which will hold the value we've read from memory.
We can use the LDR instruction to put constants into registers, by using a slightly different syntax.
This syntax actually uses the PC register as the base address:
ldr r0,=42 /* [2] load the immediate value 42 into r0 */
This syntax can also be used for loading relocatable addresses, and is what it is used for most of the time.
The STR instruction can be used to write a value from one register into memory, which another register is pointing to.
The syntax for the STR instruction is virtually the same as the LDR instruction.
str r0,[r1] /* [1] write a 32-bit word from r0 into the memory address that r1 is pointing to */
Notice: The source operand is on the left side, and the destination operand is on the right side, when we're dealing with the STR instruction!
Variants of LDR and STR instructions:
LDR, STR: These transfer 32-bit values between memory and registers.
LDRH, STRH: These transfer (unsigned) 16-bit values between memory and registers
LDRSH: This instruction transfer signed 16-bit values from memory to a register.
LDRB, STRB: These transfer (unsigned) 8-bit values between memory and registers.
LDRSB: This instruction transfer signed 8-bit values from memory to a register
LDM, LDMIA, LDMDB: These transfer multiple values from memory to registers, syntax:
ldm r0,{r1-r4} /* [5] read four 32-bit values to registers r1, r2, r3 and r4 from the address r0 points to */
ldmia r0!,{r1-r4} /* [5] read four 32-bit values to registers r1 .. r4 from address r0, then add 16 to r0 */
ldmdb r0!,{r1-r4} /* [5] subtract 16 from r0, then read four 32-bit values to registers r1 .. r4 from address r0 */
STM, STMIA, STMDB: These transfer multiple values from registers to memory. The syntax is the same as LDM, LDMIA and LDMDB.
PUSH saves registers on the stack in a way similar to STMDB, but the stack pointer is not specified. Syntax:
push {r0-r11,lr} /* [5] save registers and return-address onto the stack */
POP restores registers from the stack in a way similar to LDMIA, but like PUSH, the stack pointer is not specified. Syntax:
pop {r0-r11,pc} /* [5] restore saved registers and return to caller */
Branching is the same as jumping to a different part of the program. Branching can be unconditional or conditional.
You can use the branch instruction to jump directly to the address that a label represents, or you can use it to jump to an address contained in a register.
b olive /* [3+P][1+P] jump forward to the address that the label 'olive' represents */
... /* (this part is not executed) */
olive: /* This is where the above branch jumps to. Code after this label will be executed. */
To jump to an address, which a register is pointing to, the syntax is slightly different:
bx lr /* [3+P][1+P] jump to the addres that the Link Register (r14) is pointing to */
... The above example can be used to return from a subroutine, because the Link Register is automatically loaded with the return address, when using the BL instruction.
As mentioned above, the BL instruction (Branch and Link) jumps to an address that a label represents, then it sets the Link Register to point right after itself.
bl subroutine /* [3+P][1+P] jump to the subroutine and save the return-address in LR (r14), the Link Register */
It is also possible to branch to a subroutine, which a register is pointing to. This is done using the BLX instruction:
blx r3 /* [3+P][1+P] jump to the address in r3 and save the return-address in LR (r14), the Link Register */
Important: If your routine modifies the Link Register (aka r14), then it must save the previous value. This can be done by using the PUSH instruction.
As shown above in the PUSH and POP examples, the PUSH instruction saves LR, while the POP instruction restores PC. This means you do not have to restore LR and then branch to it, because the PC is loaded directly from the saved address on the stack.
All math instructions are based upon the syntax:
destination = source1 <operation> source2
... in other words, you should see the instructions like this ...
a = b + c
a = b / c
a = b - c
a = b * c
... etc., where 'a' is the first operand, 'b' is the second and 'c' is the third.
To bit-shift a value to the left, we can use the LSL instruction or the LSLS instruction, which also updates the condition codes.
Let's imagine that r1 contains the value %10111110111011111111101011001110 and we want to bit-shift it 3 times to the left, we can do it this way:
lsl r0,r1,#3 /* [1] operation: r0 = r1 << 3. Condition codes are left unchanged. */
Thus the result of the operation would be %11110111011111111101011001110000
If instead, we need to update the condition codes, we can use LSLS:
lsls r0,r1,#3 /* [1] operation: r0 = r1 << 3. Condition codes are updated. */
bcs carried_away /* [1/3] branch forward if the carry flag was set. */
... /* if the carry flag was cleared, we get here. */
carried_away: /* this is the label where the above bcs instruction can branch to */
If using the above mentioned value, the result of the operation would be %11110111011111111101011001110000 and the (C)arry flag, plus the (N)egative flag would be set, while the (Z)ero flag would be cleared.
We can also shift values to the right; this is done by using the LSR instruction or the LSRS instruction.
LSR and LSRS shift the value to the right. It inserts zeroes at the left hand side (eg. bit 31, the most significant bit will be changed to zero):
lsrs r0,r1,#31 /* [1] operation: r0 = r1 >> 31. Condition codes are updated. */
The above instruction actually extracts the most significant bit of r1, so the value of r0 is now either 0 or 1. This is very useful, when it is necessary to test the most significant bit of a value. If using the previously mentioned value, r0 would now contain %00000000000000000000000000000001, the (C)arry flag would be set, the (N)egative flag would be cleared, and the (Z)ero flag would also be cleared.
Thus, as you can probably imagine, LSR and LSRS are often used for unsigned values.
If you need to shift signed values to the right, then we have the ASR and ASRS instructions. Let's try changing the above LSRS to ASRS:
asrs r0,r1,#31 /* [1] operation: r0 = r1 >> 31. Condition codes are updated. */
Now the value of r0 will be %11111111111111111111111111111111. The (C)arry flag will be set, the (N)egative flag will be set and the (Z)ero flag will be cleared.
It can be very useful to use ASRS to shift the value 31 times to the right, because this keeps the sign only. It can be used directly with other operations; for instance to create a quick ABS function; we will see that later.
There are two more members of the shift/rotate family. The first two are the ROR and RORS instructions.
These rotate the value to the right. Each bit that was shifted out from bit 0, would be shifted into bit 31, thus no bits are lost.
For RORS, the last bit shifted out will also be placed in the (C)arry flag.
Let's have a look:
rors r0,r1,#14 /* [1] operation: r0 = (r1 >> 14) | (r1 << (32-14)). Condition codes are updated. */
If using the previously mentioned value, r0 would now contain %01011001110000111101110111111111. The (C)arry flag would be cleared. The (N)egative flag would be cleared and the (Z)ero flag would be cleared.
The last two instructions are the RRX and RRXS instructions. These do the same thing as ROR and RORS, except that they can only shift the value one bit position to the right.
rrxs r0,r1 /* [1] operation: nC = r1 & 1; r0 = (r1 >> 1) | (C << 31); C=nC. Condition codes are updated. */
Using the previously mentioned value, our result would now be: %C1011111011101111111110101100111, where C is the previous value of the (C)arry flag. The (C)arry flag would now be cleared.
The RRX instruction does not update the flags; it does not update the (C)arry flag either. However, it still reads the (C)arry flag, so the result of the above operation will be the same, however, the (C)arry flag will stay what it was, so will the (Z)ero, o(V)erflow and (N)egative flags.
We will have a closer look at how rrx and rrxs can be used in part 4 of these articles.
Some of the shift and rotate instructions are actually MOV and MOVS instructions with a shift-operation on the second operand.
But since the Cortex-M0 does not have a second operand, it implements the shift instructions as stand-alone instructions.
The Cortex-M3 and later is able to execute binary Cortex-M0 instructions, which means that the 16-bit instructions are still specific shift instructions, whereas the wider instructions are actually MOV and MOVS instructions. This normally does not matter when you're writing code, though.
Note: Bit-shifting can be used to quickly multiply or divide by a number which is in the power of two, for instance by 2, 4, 8, 16, 32, etc...
To add two values, we can use the ADD instruction or her sister, the ADDS instruction, which updates the condition codes.
add r0,r1,r2 /* [1] operation: r0 = r1 + r2. Condition codes are left unchanged. */
We can use the adds variant to update the condition codes. This basically means that the result of the operation is compared to zero.
adds r0,r1,r2 /* [1] operation: r0 = r1 + r2. Condition codes are updated. */
bmi r0_negative /* [1/3] branch forward if the result of the operation is negative. */
... /* we'll get here when the result of the operation is positive. */
r0_negative: /* this is the label where the above bmi instruction can branch to. */
The add instruction can do slightly more than just add numbers.
We can use it with bit-shifting as well, for instance for multiplying constant values; here we multiply by 3:
add r0,r0,r0,lsl#1 /* [1] operation: r0 = r0 + (r0 << 1). */
... or by 5 ...
add r0,r0,r0,lsl#2 /* [1] operation: r0 = r0 + (r0 << 2). */
... by 9 ...
add r0,r0,r0,lsl#3 /* [1] operation: r0 = r0 + (r0 << 3). */
Note: On Cortex-M0, it is not possible to use bit-shifting on the second operand due to the limited number of opcodes in the Armv6 instruction set.
Subtracting can be done using the SUB instruction or his brother, the SUBS instruction.
sub r0,r1,r2 /* [1] operation: r0 = r1 - r2. Condition codes are left unchanged. */
... or updating the condition codes using the variant...
subs r0,r1,r2 /* [1] operation: r0 = r1 - r2. Condition codes are updated. */
bpl r0_positive /* [1/3] branch forward if the result of the operation is positive. */
... /* we'll get here when the result of the operation is negative. */
r0_positive: /* this is the label where the above bpl instruction can branch to. */
Like add and adds, sub and subs also allow bit-shifting on the second operand...
sub r0,r1,r1,lsr#1 /* [1] operation: r0 = r1 - (r1 >> 1). This divides by two and rounds up. */
The above instruction differs from just bitshifting once to the right, because if r1 is 7, then LSR r0,r1,#1 would result in r0 becoming 3. sub r0,r1,r1,lsr#1 would subtract 3 from 7, thus the result would be 4. This means you can divide by two and round up in a single instruction.
To multiply two registers, you can use the MUL and MULS instructions.
mul r0,r1,r2 /* [1] operation: r0 = r1 * r2. This instruction does not update the condition codes. */
The MULS instruction is identical to MUL, but it also updates the condition codes.
To divide a register by the value of another register, you can use the SDIV and UDIV instructions.
Unlike MUL, SDIV and UDIV do not have variants that update the condition codes.
SDIV is for dividing signed numbers, UDIV is for dividing unsigned numbers:
sdiv r0,r1,r2 /* [1] operation: r0 = r1 / r2. This instruction does not update the condition codes. */
Bitwise OR, bitwise AND and bitwise Exclusive OR are all very important part of any CPU. Instructions, such as ADD and subtract are built from AND, OR and XOR.
On the Cortex-M series, we have a little more than the basics; because we have some complementary variants as well, those are called BIC and NOR.
The AND and ANDS instructions
and r0,r1,#0x0f /* [1] operation: r0 = r1 & 15. This instruction does not update the condition codes. */
The above operation will copy the low 4 bits of r1 to r0; the remaining bits of r0 will be zero. Thus the value in r0 can be anything between 0 and 15, both inclusive.
It's very useful to combine the AND operation with shifts. For instance, you may want to rotate a register and keep a selected number of bits. It can be done this way:
movs r0,#0x003fc /* [1] operation: This is a byte mask multiplied by 4. */
ands r0,r0,r1,ror#(8-2) /* [1] operation: r0 = r0 & (r1 >> 6). */
In the above example, r1 contains 4 bytes, and we want to extract those 4 bytes one at a time, but we also want the final result to be multiplied by 4. Thus we only have to load the mask once, and can issue four AND instructions in a row.
If we wanted, we could also just extract the value at its position by moving the mask around:
ands r0,r1,r0,ror#(32-8) /* [1] operation: r0 = r1 & (r0 << 8). */
The ORR and ORRS instructions
orr r0,r1,#0x0a /* [1] operation: r0 = r1 | 10. This instruction does not update the condition codes. */
The above operation will set bits 1 and 3 in r0, the other bits will be copied from r1.
We can combine the ORR instruction with bit-shifting to form a quick SGN operation.
movs r3,#1 /* [1] get a one. */
orrs r1,r3,r2,asr#31 /* [1] operation: r1 = r3 | r2 >> 31. */
The result of the operation can be used with the MUL instruction, in order to negate a value if it's negative, thus we have an ABS function:
mul r0,r2,r1 /* [1] get the absolute value of r2 to r0. */
This is very useful, when working with distances between two points, or when drawing lines (eg. Bresenham). You can of course re-use the value in r3, so each ABS operation will take only two clock cycles. The standard ABS operation involves branching. One of the optimized ABS operations is patented, thus there are certain instructions you are not allowed to use in your code. (What do the instruction architects think about this, btw?). I hereby declare my version of ABS in the public domain. That means it can not be patented by anyone, and may freely be used in your code.
The EOR and EORS instructions (Exclusive OR)
eor r0,r1,#0x05 /* [1] operation: r0 = r1 ^ 5. This instruction does not update the condition codes. */
The above operation will copy all bits from r1 to r0, except that it will toggle (eg. invert) bit 0 and bit 2.
The BIC and BICS instructions (BIt Clear)
bic r0,r1,#0x0f /* [1] operation: r0 = r1 & ~15. This instruction does not update the condition codes. */
The above operation is doing exactly the oposite of AND. It clears those bits that we specify in the second operand.
Note: Due to the nature of bic, exchanging the two source operands will completely change the result of the operation.
Like with AND, it's also useful to combine the BIC instruction with shifts and rotate, especially when dealing with code that needs to be very rapid.
The ORN and ORNS instructions (OR Not)
orn r0,r1,#0x0f /* [1] operation: r0 = r1 | ~15. This instruction does not update the condition codes. */
This operation does the oposite of the OR instruction... Well, it actually sets those bits we do not specify as 1; in other words, where we specify a zero, the bits will be set.
The CMP instruction can be used for comparing values. It is almost identical to SUBS, except that it does not modify any registers.
cmp r0,r1 /* [1] operation: r0 ? r1. This instruction always update the condition codes. */
bgt r0_greater_than_r1 /* [1/3+P][1/1+P] if r0 is greater than r1, then branch forward. */
... /* we get here if r0 is less than or equal to r1. */
r0_greater_than_r1: /* this is the label for the bgt branch above. */
There are two good ways to remember how the CMP instruction work.
As CMP behaves exactly the same way SUBS does, except that the result is not saved in any register, the first mentioned method is probably the best.
The second method is the one I prefer. I prefer it, because I tend to imagine a 'greater than' or 'less than' character between the two operands.
Thus the condition of the branch instruction names the character that would go between the two operands in the operand field.
The CMN (CoMpare Negative) instruction is an oposite to CMP just like ADD is an oposite to SUB.
CMN is almost identical to the ADDS instruction, except that it does not modify any registers.
Only the condition codes are changed.
The TST instruction is almost identical to the ANDS instruction, the only difference is that it does not modify any registers.
tst r0,r1 /* [1] operation: r0 & r1. This instruction always update the condition codes. */
Only the condition codes are changed. If the result of the operation is zero, the (Z)ero flag will be set, otherwise it will be cleared. If bit 31 of the result is set, the (N)egative flag will be set, otherwise it will be cleared. The o(V)erflow and (C)arry flags are unaffected by this operation, so it's useful when you need to test bits while preserving the carry flag for instance.
The TEQ instruction is almost identical to the EORS instruction; the only differnece is that like TST, it does not modify any registers.
teq r0,#3 /* [1] operation: r0 ^ r1. This instruction always update the condition codes. */
As like TST, only the Z and N flags are changed. Thus the Z flag is set, if there is an exact match only. The N flag is set if bit 31 in the destination operand differs from the source operand.
First of all, here are a few good reasons to use the MLA instruction.
"MLA" is an abbreviation of MuLtiply with Accumulate (or MuLtiply and Add if you wish)
On Cortex-M3, MLA uses 2 clock cycles, so in fact, there's no difference between using MLA and MUL followed by an ADD instruction.
-Or is there ?
I would say otherwise.
First of all, it's obvious that two instructions most likely will use more space than a single instruction.
Assuming that both MUL and ADD were 16 bit, and a MLA was 32-bit, then you have not used any extra space by using MLA.
But there are two more places where you can benefit from using MLA:
There is one thing to consider. Because sometimes you will need to calculate a product once and then add it to more than one register.
In such cases, using a single MUL and two ADD instructions will outperform MLA on the Cortex-M3, but on the Cortex-M4 (and later) MLA will still be faster.
Personally, I would pick MUL+2*ADD on Cortex-M3, because it makes your program run as fast as possible on this platform.
If you are using anything 'array', you could probably benefit from using MLA.
These examples are real-world examples; things that you'll probably use often (even without knowing it).
r0 = 32-bit value to store
r1 = x position in matrix
r2 = y position in matrix
r3 = array base address
r4 = width of matrix in 32-bit words
mla r2,r2,r4,r1 /* [2][1] r2 = y * width + x */
str r0,[r3,r2,lsl#2] /* [1] store the 32-bit word there. */
For instance, you have an X, Y and Z index in an array.
r0 = value to store
r1 = x
r2 = y
r3 = z
r4 = array base address
r5 = width
r6 = height * width
(we do not need the depth)
mla r2,r2,r5,r1 /* [2][1] r2 = y * width + x */
mla r2,r3,r6,r2 /* [2][1] r2 = z * height * width + r2 */
str r0,[r4,r2,lsl#2] /* [1] store the 32-bit word there. */
5 clock cycles on Cortex-M3 and only 3 on Cortex-M4. Pretty neat, ay ?
Most other architectures need to spend ages calculating the index, and the code size can be huge; especially if we're talking about 8-bit multiplication. Such an example wouldn't fit this document, so I've chosen to exclude it.
"MLS" is an abbreviation of MuLtiply with Subtract (or you could call it MuLtiply and Subtract)
The MLS instruction is similar to MLA, but it subtracts the product from the third operand.
This makes it suitable for finding the remainder of a division (this is also called a modulo or modulus operation).
Getting the remainder of a division:
udiv rT,rA,rB /* rT = rA / rB (example: 1 = 17 / 10) */
mls rT,rT,rB,rA /* rT = rA - rT * rB (example: 7 = 17 - (1 * 10)) */
udiv r0,r3,r2 /* [2..12] r0 = r3 / r2. */
mls r1,r0,r2,r3 /* [2] r1 = r3 - r0 * r1 */
This first part only deal with the foundation instructions.
These are necessary, in order to get to pave the way for the next part.
I think the next part might be a little more challenging.
That is a huge effort in writing this. Sure it will be helpful for me and a lot others.
Thanks for this jensbauer