Size matters...

November 27, 2013

4 minute read time.

Size matters

How much does it take to increment a number? One instruction? Two? It is not as simple a question as you might imagine. And, believe me, this is one case where size matters.

On an ARM processor, any ARM processor, you can increment a 32-bit number in a single instruction, usually taking one cycle. It really is that simple. This is because ARM processors are 32-bit devices, with 32-bit registers, a 32-bit ALU and 32-bit internal data paths. They are good at doing 32-bit operations efficiently - because that is what they are designed to do.

Look at the C routine below and its corresponding assembly code. You can see that the increment operation translates to a single instruction.

C Code	ARM Assembly Code
int increment(int a) { a = a + 1; return a; }	increment add r0, r0, #1 bx lr

But not every processor is an ARM processor. An 8051, for instance, has a natural data size of 8 bits. It has 8-bit registers and its ALU carries out 8-bit operations. So, what might it take to increment a 32-bit variable on an 8051? Try this:

C Code

8051 Assembly Code

long increment(long a)
{
a = a + 1;
return a;
}

; a assigned to R4:R5:R6:R7

MOV A, R7
ADD A, #01h
MOV R7, A
CLR A
ADDC A, R6
MOV R6, A
CLR A
ADDC A, R5
MOV R5, A
CLR A
ADDC A, R4
MOV R4, A
RET

It is clear that 32-bit addition on an 8051 is much harder and much more time-consuming than a simple 8-bit addition. Since the ALU can only handle 8 bits at a time, four separate additions are required to propagate any carry across four separate parts of the result. On an ARM processor, the reverse is true. Here is an example of incrementing an 8-bit variable.

C Code	ARM Assembly Code
int increment(unsigned char a) { a = a + 1; return a; }	increment ADD r0, r0, #1 AND r0, r0, #0xFF BX lr

Although it may not seem much of an overhead, the compiler has to insert extra instructions to remove unwanted overflow and restrict the 32-bit result to fit in a declared 8-bit variable. The same would be true when using a 16-bit variable.

So, when moving from other, “smaller” architectures to ARM a change in mindset is necessary. It is no longer the right decision to choose the smallest possible container for a variable. Instead, 32-bit variables should be the default as they are the most efficient, arithmetically.

Store small, process large

But, in many applications, storage space is at a premium. That means you may still want to choose the smallest viable size for a particular variable so that it takes up the least possible space in memory. That can still be an efficient choice on ARM too. But you should still process items at the natural size of the core i.e. 32-bit words. ARM processors have byte and halfword sized load and store instructions which make it very easy to do the conversion at the time you transfer values into and out of registers. Here is an example of incrementing an 8-bit variable held in memory.

C Code	ARM Assembly Code
unsigned char a; void increment_a(void) { a = a + 1; }	increment_a LDR r0, =&a LDRB r1, [r0] ADD r1, r1, #1 STRB r1, [r0] BX lr

(Yes, I know that the first statement isn’t legal assembler but you can see what it means!)

Here, the LDRB and STRB instructions automatically zero-extend the 8-bit value when loading it and truncate it when storing it. This takes care of the size adjustment and it is almost free – there may be an additional cycle of latency on the load instruction on some cores. Of course, if you want to do some more complex processing on an 8-bit variable, then it might be necessary to copy it into a word-sized local variable after loading it. It can then be processed at the natural word size and then only truncated again when finally written back to memory.

So remember, small isn’t always beautiful!

(ARM Processors)

Sean Ellis over 11 years ago

Another thing to note is that by using an ADDS instruction, the ARM code sequence sets the flags correctly for the 32-bit value as a whole. The longer 8051 sequence will also do so (assuming I'm remembering my 8051 assembly correctly), but the shorter sequence using INC instructions does not, as the INC instruction does not affect the carry or overflow flags.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Chris Shore over 11 years ago

Again, thanks for the comment. I think the 8051 is often the obvious comparator as it is very widely used and many engineers have used it a lot and understand it well.
You are right to point out the much longer code for incrementing a 32-bit value. This exposes one of the major advantages of the ARM microcontroller cores in that they are designed as native 32-bit machines which process 32-bit values very well. Processors, like 8051, with a smaller natural word size are much less efficient by comparison.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Chris Shore over 11 years ago

Thanks for the comment. I think I am in the clear as I am writing from the point of view of the ARM compiler which assumes char to be unsigned by default. Still, to be absolutely clear, I have edited the document and made the declaration explicitly unsigned.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
42Bastian over 11 years ago

Your assembler code for int increment(char a); is wrong if you assume char to be signed (as is the following example).
So instead of the AND you need to sign-extend r0.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
42Bastian over 11 years ago
I always wonder why always 8051 is used to compare against ARM. Anyway the 8051 code for the increment functions is really poor.
Check this:
increment:
inc r7 jnz exit inc r6 jnz exit inc r5 jnz exit inc r4 exit: ret
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Size matters...

Size matters

Store small, process large

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order