I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.
I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.
I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?
Below is my very simple code, which simply loops around and does nothing.0x80001d6: bf00 nop
0x80001d8: 3c01 subs r4, #1
0x80001da: d1fc bne.n 80001d6
This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.
I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.Is there any possible explanation for this? What I thought was:
1. Is halfword-aligned instructions or halfword-aligned branch slower?
2. Can this somehow be related to dual-issue?3. Can this be because I am crossing some sort of a page/bank boundary?
4. Can this be vendor-specific or is this something about the ARM architecture?I have searched a lot, but have not seen any relevant info.
Any help will be appreciated.
Thank you,
From the Cortex-M7 manual:
1.2.2 Prefetch UnitThe Prefetch Unit (PFU) provides:• 64-bit instruction fetch bandwidth.• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.• A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor state and target address.• A static branch predictor when no BTAC is specified.• Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.
So, to me, this might be the reason.
Thank you for the pointer! However, I am not sure how that is related to showing performance variation when the instruction addresses are offsetted by halfword. If you have an insight on what might be actually happening, can you help me understand?
When the loop starts a ..1d8 it is on a 64bit boundary, so the whole loop can be pre-fetched at once.
Thank you for your answer! The performance variability also occurs when the i-cache is turned on. If the i-cache is on, isn't the prefetcher not relevant anymore? I thought the main benefit of a prefetcher is to bring code from e.g., Flash to i-cache.Or I wonder if a similar problem can be there for the instruction fetch unit of the execution pipeline (not the prefetcher).
No. Please read above quote again.
I read it again and read relevant parts of Cortex M7, but cannot understand which part of my question you are saying "No" to. My problem still occurs when I have an i-cache. With I-cache, wouldn't the three instructions sit inside the i-cache and the prefetcher behavior would be irrelevant?
No, because the cache is just between the system (RAM/Flash) and the prefetcher. Though I would not expect a very large impact.The "Flash accelerator" is a special kind of cache which is optimized for the specific Flash and is ST's IP. The prefetcher is part of the core.