This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Code alignment significantly affecting performance?

I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.

I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.

I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?


Below is my very simple code, which simply loops around and does nothing.

0x80001d6: bf00 nop

0x80001d8: 3c01 subs r4, #1

0x80001da: d1fc bne.n 80001d6

This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.
However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.

I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.
Is there any possible explanation for this? What I thought was:

1. Is halfword-aligned instructions or halfword-aligned branch slower?

2. Can this somehow be related to dual-issue?
3. Can this be because I am crossing some sort of a page/bank boundary?

4. Can this be vendor-specific or is this something about the ARM architecture?

I have searched a lot, but have not seen any relevant info.

Any help will be appreciated.

Thank you,