Cortex-M3/M4 Benchmark Anomaly: Why is Code in Flash faster than Code in SRAM?

d d 1 month ago

Hi everyone, I've been running some performance benchmarks on Cortex-M3 / M4 and noticed some unexpected results. Would love to hear your thoughts.

**Hardware Setup:**
- Cortex-M3 @ 108MHz, ICache enabled
- Cortex-M4 @ 360MHz, ICache, DCache enabled
- Both using internal SRAM

---

**[Test 1] SHA256**

Cortex-M3:
- Code Flash + Data Flash: 117755 ms
- Code Flash + Data SRAM: 27689 ms
- Code SRAM + Data SRAM: 27843 ms

Cortex-M4:
- Code Flash + Data Flash: 4166 ms
- Code Flash + Data SRAM: 4139 ms
- Code SRAM + Data SRAM: 7875 ms

→ On M3, moving Data to SRAM dramatically improved performance. But on M4, placing Code in SRAM is actually slower than Flash. Why the opposite behavior?

---

**[Test 2] Intensive STR/LDR (heavy memory access)**

Cortex-M3:
- Code Flash + Data Flash: 101028 ms
- Code Flash + Data SRAM: 1330 ms
- Code SRAM + Data SRAM: 1988 ms

Cortex-M4:
- Code Flash + Data Flash: 923 ms
- Code Flash + Data SRAM: 786 ms
- Code SRAM + Data SRAM: 1334 ms

→ Moving Data to SRAM gives a massive improvement. But for both M3 and M4, Code in Flash is faster than Code in SRAM. Why?

---

**[Test 3] ALU only (pure computation, no memory access)**

Cortex-M3:
- Code in Flash: 9967 ms
- Code in SRAM: 22149 ms

Cortex-M4:
- Code in Flash: 3603 ms
- Code in SRAM: 5339 ms

→ This is the most confusing result. Pure ALU operations with no memory access, ICache is enabled on both — yet Code in Flash is faster than Code in SRAM on both cores. Shouldn't SRAM execution be faster?

My hypothesis is that ICache only caches instructions fetched from Flash, and when Code is placed in SRAM, the ICache is completely bypassed — forcing every instruction fetch to go through the System Bus and introducing extra latency. But I'm not sure if this is correct.

---

**Questions:**
1. Does ICache only work for Flash? Is it completely bypassed when Code runs from SRAM?
2. In the STR/LDR test, why is Code in Flash faster than Code in SRAM?
3. In the M4 SHA256 test, Code in SRAM is slower than Flash — the opposite of what I'd expect. What causes this?
4. Are these results related to AHB Bus latency?
5. Why does Code placement make such a huge difference on M4 Flash, while Data placement has relatively little impact?
6. Are these phenomena more related to IBus, DBus, and SBus behavior?
7. Based on the results, it seems placing everything in SRAM does NOT always give the best performance. Is this a correct conclusion?

Thanks in advance!