A branch, quite simply, is a break in the sequential flow of instructions that the processor is executing. Some other architectures call them jumps, but they're essentially the same thing. The following is a trivial, and hopefully familiar example of a branch:
entry_point: mov r0, #0 @ Set r0 to 0. b target @ Jump forward to 'target'. mov r0, #1 @ Set r0 to 1. target: ... @ At this point, r0 holds the value 0. ... @ The second mov instruction did not execute.
There are several variants of branches in the Arm and Thumb instruction sets. Several of these variants are in common with many other CPU architectures, but there are also a few branch variants specific to Arm. Each variant is explained in detail below:
A relative branch is one where the target address is calculated based on the value of the current pc (program counter). Given the example above, an assembler would work out that the target label is eight bytes ahead of the b target instruction (in Arm code) and then generate a relative branch which means 'jump forward by eight bytes'. Relative branches are essential for position-independant code, which is expected to run correctly at any location in memory. The most common relative branches on Arm are single instructions and tend to be the most efficient branches available, though they have limited range.
pc
target
b target
An absolute branch will always jump to the specified address, regardless of the current pc. Absolute branches are used when the address of the target is provided as a function pointer, for example. However, because an absolute branch requires a full 32-bit target address, absolute branches usually require a load or some other constant-loading mechanism in addition to the branch instruction itself.
In many cases, the programmer (or compiler) may not actually care whether a branch is relative or absolute, and might just use whichever is most efficient on a case-by-case basis.
Because the Arm instruction set is fixed-width at 32 bits (and Thumb has either 16 or 32 bits), it is not possible to encode a full 32-bit branch offset in a single instruction. Relative branches can be encoded using a limited-range offset from the current pc. In assembly code, this is usually written as a branch to a label (as in the example above). The assembler will work out the required offset.
The range available varies between Arm and Thumb (and in a few cases also between instruction variants) but is usually very large and quite sufficient for most branches within a program. By using various combinations or additional instructions and literal pool loads, it is also possible to construct arbitrary full-range branches in case the single-instruction range is not sufficient. All practical absolute branches are necessarily full-range, since a 32-bit target address needs to be loaded.
Almost every modern programming language has some concept of functions. Any given function can (in general) be called from any part of a program, so processor architectures need some way to store the address of the caller. On Arm processors, this return address is stored in lr (the link register). Branch instructions with an l suffix -- like bl and blx -- work just like a standard b or bx branch, but also store a return address in lr.
lr
l
bl
blx
b
bx
If a function does not modify lr, then the return sequence can (and should) be a simple "bx lr". Otherwise, the lr can be pushed onto the stack at function entry. From here, the best return sequence is usually to pop directly into pc, though a number of other options are possible depending on the situation.
bx lr
pop
Programs on Arm processors can use either the Arm or Thumb instruction set, or both. Whilst Arm and Thumb instructions cannot be directly interleaved, it is possible to switch (or interwork) between Arm and Thumb states at run-time. This interworking is most notably achieved using special branch instructions with an x suffix, like bx and blx. Several other branch mechanisms are also capable of interworking. For example, the return sequence which writes to the pc using pop (or any other memory access) can interwork, and will always return in the appropriate state.
x
Branch instructions fall into three classes: Instructions that never change state (like "b label"), instructions that always change state (like "blx label"), and instructions that automatically change state based on the target address (like "bx register").
b label
blx label
bx register
Address-based interworking uses the lowest bit of the address to determine the instruction set at the target. If the lowest bit is 1, the branch will switch to Thumb state. If the lowest bit is 0, the branch will switch to Arm state. Note that the lowest bit is never actually used as part of the address as all instructions are either 4-byte aligned (as in Arm) or 2-byte aligned (as in Thumb).
The following table lists the branch instructions commonly used on Arm processors:
It is also possible to write into the pc using arithmetic instructions, but this is useful only in specific cases 1, and use of the normal branch instructions is advisable where possible.
Most of the interworking branches were added on Armv5T. The only way to interwork on Armv4T was to use the bx instruction. Armv4T interworking branch sequences are often much less efficient than the Armv5T versions, so it's best to use Armv5T branches unless you really need Armv4T compatibility.
To encode more complex branches than those listed above, a combination of instructions must be used. In cases like this, where the target address must be calculated in advance of the branch instruction, normal methods for loading and calculated values are used. Arithmetic might be used for long-range relative branches, for example, and a constant pool load might be used for an absolute branch.
Finally, there are a few branches available specifically in the Thumb-2 instruction set that are designed for specific use-cases. These are not available to the Arm instruction set (or to the original Thumb instruction set), and so I will give them only a brief mention, but if you're writing Thumb-2 code they can be very useful. For further details, refer to the Armv7-A/R Architecture Reference Manual.
For each special-purpose branch, I will also give a roughly equivalent Arm implementation. The Arm implementations have different limitations (such as branch range) and have other side effects (such as requiring a scratch register). Nevertheless, they should serve to clarify the behaviour of the Thumb-2 instructions.
cbnz
cbz
The cbnz (compare, branch on non-zero) and cbz (compare, branch on zero) instructions are useful for very short-range forward branches, such as loop terminations, that would otherwise require two or more instructions. The two-instruction version is still available, of course, and may be useful if more range is required, or if a more complicated comparison is required.
Arm Implementation Thumb-2 Implementation cmp rA, #0 cbz rA, label beq label cmp rA, #0 cbnz rA, label bne label
tbb
tbh
The tbb (table branch byte) and tbh (table branch halfword) instructions are useful for the implementation of jump tables. One argument register is a base pointer to a table, and the second argument is an index into the table. The value loaded from the table is then doubled and added to the pc.
Arm Implementation Thumb-2 Implementation ldrb ip, [rA, rB] tbb rA, rB add pc, pc, ip, lsl #1 ldrh ip, [rA, rB, lsl #1] tbh rA, rB, lsl #1 add pc, pc, ip, lsl #1
1A typical example of where arithmetic-based branches are useful is in the implementation of jump tables, but they are occasionally useful in other cases.
2Bit 0 of the address indicates the instruction set of the target. If 1, the target is Thumb. If 0, the target is Arm.
If you want to jump back to a calling function in c from assembly but to the next instruction we do a ldr PC,lr+4 for arm . but if we want to jump back to the next instruction and its thumb it could be or+2 or or+4 as thumb2 can be 2 or 4 byte. So how do we know. If the compiler of the c func used a 2 or 4 byte instruction and therefore how do we know whether to add 2 or 4 to the lr ?
Sorry, I didn't see this when it was posted!
By "ldr PC,lr+4" I think you mean "mov pc, lr + 4". However, the +4 isn't necessary; the call sequence sets up lr so that it points to the next instruction. The callee could do a simple "mov pc, lr" to return, but it would be much better to use "bx lr" for this.
Regarding the +2 vs +4 offset: The original Arm instruction read PC+8 because it had a three-stage pipeline, so the PC value read by an instruction was two instructions (or eight bytes) ahead. The original Thumb instruction set only had two-byte instructions, so it read PC+4. When four-byte Thumb instructions were introduced, this behaviour was preserved, so Thumb always reads PC+4 irrespective of the size of the instruction used to read it.
In your specific case, you have a calling function written in C, which will almost certainly use "bl" or "blx" to call your assembly function. Your assembly function doesn't have to know exactly how this was done, because "bl" and "blx" will both set LR to the next instruction. Your assembly function should just use "bx lr" to return.