This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Code alignment significantly affecting performance?

I am running Thumb-2 instruction code on M-7 processor, STM32F750N8.

I am seeing non-negligible performance number variation depending on whether I insert one NOP right before a tight loop (and changing the address of each instruction by halfword). Other parts of the code are not touched at all.

I am not sure why this is happening. Does Thumb-2 instruction (especially, branches) run faster or slower depending on whether they are halfword- or word-aligned? Or can there be any other explanation?


Below is my very simple code, which simply loops around and does nothing.

0x80001d6: bf00 nop

0x80001d8: 3c01 subs r4, #1

0x80001da: d1fc bne.n 80001d6

This very simple 3 line code, when I loop for 10000000 times, takes about 100ms.
However, when I add nop at the beginning (so that the addresses move by 2 bytes to 0x80001d8, 0x80001da, 0x80001dc), the execution time is significantly reduced to 75ms.

I have tried disabling the I-cache and D-cache, and turned off the flash prefetcher and ST's flash accelerator, but a similar phenomenon was still there.
Is there any possible explanation for this? What I thought was:

1. Is halfword-aligned instructions or halfword-aligned branch slower?

2. Can this somehow be related to dual-issue?
3. Can this be because I am crossing some sort of a page/bank boundary?

4. Can this be vendor-specific or is this something about the ARM architecture?

I have searched a lot, but have not seen any relevant info.

Any help will be appreciated.

Thank you,

Parents
  • From the Cortex-M7 manual:


    1.2.2 Prefetch Unit
    The Prefetch Unit (PFU) provides:
    • 64-bit instruction fetch bandwidth.
    • 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
    • A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor
    state and target address.
    • A static branch predictor when no BTAC is specified.
    • Forwarding of flags for early resolution of direct branches in the decoder and first
    execution stages of the processor pipeline.

    So, to me, this might be the reason.

Reply
  • From the Cortex-M7 manual:


    1.2.2 Prefetch Unit
    The Prefetch Unit (PFU) provides:
    • 64-bit instruction fetch bandwidth.
    • 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
    • A Branch Target Address Cache (BTAC) for single-cycle turn-around of branch predictor
    state and target address.
    • A static branch predictor when no BTAC is specified.
    • Forwarding of flags for early resolution of direct branches in the decoder and first
    execution stages of the processor pipeline.

    So, to me, this might be the reason.

Children