I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.
I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.
I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?
Below is my very simple code, which simply loops around and does nothing.0x80001d6: bf00 nop
0x80001d8: 3c01 subs r4, #1
0x80001da: d1fc bne.n 80001d6
This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.
I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.Is there any possible explanation for this? What I thought was:
1. Is halfword-aligned instructions or halfword-aligned branch slower?
2. Can this somehow be related to dual-issue?3. Can this be because I am crossing some sort of a page/bank boundary?
4. Can this be vendor-specific or is this something about the ARM architecture?I have searched a lot, but have not seen any relevant info.
Any help will be appreciated.
Thank you,
No, because the cache is just between the system (RAM/Flash) and the prefetcher. Though I would not expect a very large impact.The "Flash accelerator" is a special kind of cache which is optimized for the specific Flash and is ST's IP. The prefetcher is part of the core.