I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.
I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.
I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?
Below is my very simple code, which simply loops around and does nothing.0x80001d6: bf00 nop
0x80001d8: 3c01 subs r4, #1
0x80001da: d1fc bne.n 80001d6
This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.
I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.Is there any possible explanation for this? What I thought was:
1. Is halfword-aligned instructions or halfword-aligned branch slower?
2. Can this somehow be related to dual-issue?3. Can this be because I am crossing some sort of a page/bank boundary?
4. Can this be vendor-specific or is this something about the ARM architecture?I have searched a lot, but have not seen any relevant info.
Any help will be appreciated.
Thank you,
When the loop starts a ..1d8 it is on a 64bit boundary, so the whole loop can be pre-fetched at once.
Thank you for your answer! The performance variability also occurs when the i-cache is turned on. If the i-cache is on, isn't the prefetcher not relevant anymore? I thought the main benefit of a prefetcher is to bring code from e.g., Flash to i-cache.Or I wonder if a similar problem can be there for the instruction fetch unit of the execution pipeline (not the prefetcher).
No. Please read above quote again.
I read it again and read relevant parts of Cortex M7, but cannot understand which part of my question you are saying "No" to. My problem still occurs when I have an i-cache. With I-cache, wouldn't the three instructions sit inside the i-cache and the prefetcher behavior would be irrelevant?
No, because the cache is just between the system (RAM/Flash) and the prefetcher. Though I would not expect a very large impact.The "Flash accelerator" is a special kind of cache which is optimized for the specific Flash and is ST's IP. The prefetcher is part of the core.