This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Code alignment significantly affecting performance?

kmaeng over 4 years ago

I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.

I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.

I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?

Below is my very simple code, which simply loops around and does nothing.

0x80001d6: bf00 nop

0x80001d8: 3c01 subs r4, #1

0x80001da: d1fc bne.n 80001d6

This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.
However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.

I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.
Is there any possible explanation for this? What I thought was:

1. Is halfword-aligned instructions or halfword-aligned branch slower?

2. Can this somehow be related to dual-issue?
3. Can this be because I am crossing some sort of a page/bank boundary?

4. Can this be vendor-specific or is this something about the ARM architecture?

I have searched a lot, but have not seen any relevant info.

Any help will be appreciated.

Thank you,

Top replies

Parents

0 42Bastian Schick over 4 years ago

From the Cortex-M7 manual:

1.2.2 Prefetch Unit
The Prefetch Unit (PFU) provides:
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor
state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first
execution stages of the processor pipeline.

So, to me, this might be the reason.
Cancel
Vote up +1 Vote down

Cancel

Reply

0 42Bastian Schick over 4 years ago

From the Cortex-M7 manual:

1.2.2 Prefetch Unit
The Prefetch Unit (PFU) provides:
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor
state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first
execution stages of the processor pipeline.

So, to me, this might be the reason.
Cancel
Vote up +1 Vote down

Cancel

Children

0 kmaeng over 4 years ago in reply to 42Bastian Schick

Thank you for the pointer! However, I am not sure how that is related to showing performance variation when the instruction addresses are offsetted by halfword. If you have an insight on what might be actually happening, can you help me understand?
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to kmaeng

When the loop starts a ..1d8 it is on a 64bit boundary, so the whole loop can be pre-fetched at once.
Cancel
Vote up +1 Vote down

Cancel
0 kmaeng over 4 years ago in reply to 42Bastian Schick

Thank you for your answer! The performance variability also occurs when the i-cache is turned on. If the i-cache is on, isn't the prefetcher not relevant anymore? I thought the main benefit of a prefetcher is to bring code from e.g., Flash to i-cache.
Or I wonder if a similar problem can be there for the instruction fetch unit of the execution pipeline (not the prefetcher).
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to kmaeng

No. Please read above quote again.
Cancel
Vote up 0 Vote down

Cancel
0 kmaeng over 4 years ago in reply to 42Bastian Schick

I read it again and read relevant parts of Cortex M7, but cannot understand which part of my question you are saying "No" to. My problem still occurs when I have an i-cache. With I-cache, wouldn't the three instructions sit inside the i-cache and the prefetcher behavior would be irrelevant?
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 4 years ago in reply to kmaeng

No, because the cache is just between the system (RAM/Flash) and the prefetcher. Though I would not expect a very large impact.
The "Flash accelerator" is a special kind of cache which is optimized for the specific Flash and is ST's IP. The prefetcher is part of the core.
Cancel
Vote up 0 Vote down

Cancel