Question about the performance overhead introduced by the MTE mechanism

Hello Community,

We are evaluating the performance introduced by MTE on the Pixel 8 Pro. We select SPEC2017 C++ rate suites as benchmarks and pin them on Cortex-A715 cores. The paper, StickyTags(ieeexplore.ieee.org/.../stamp.jsp, points out that frequent memory tagging (STG instruction) is a major performance bottleneck of existing MTE-based solutions. However, our evaluation results show more details and are confusing. Specifically, we only enable MTE tag memory for the stack and heap respectively(via mprotect(..., PROT_MTE...)), and evaluate on different MTE check modes: ignore mode, async mode, sync mode, and disable tag check. In particular, the User-mode process doesn't execute any STG instruction, i.e., there is no frequent memory tagging. We have run each benchmark at least 3 times, and select the average value as the last result.

The result on the stack is shown below(stack-mte-ignore-tcf means enable MTE tag memory for the stack and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):

Benchmark(SPEC2017 rate  C++) stack-mte-ignore-tcf stack-mte-ignore-tcf-enable-tco stack-mte-async-tcf stack-mte-async-tcf-enable-tco stack-mte-sync-tcf stack-mte-sync-tcf-enale-tco
520.omnetpp_r -1.32% 0.13% -0.13% 0.00% 3.30% 3.04%
523.xalancbmk_r 2.06% 2.35% 5.59% 2.65% 3.24% 5.00%
531.deepsjeng_r -2.97% 12.10% -2.51% 11.42% -1.83% 10.73%
541.leela_r 1.46% 2.71% 1.46% 2.51% 1.88% 3.34%
508.namd_r -0.37% 0.00% -0.37% 0.00% -0.37% 0.00%
510.parest_r 6.82% 2.35% 6.61% 2.35% 6.61% 2.35%
511.povray_r 1.35% 3.37% 1.12% 3.82% 2.70% 3.60%
526.blender_r 0.00% 14.20% 0.63% 13.56% 0.63% 13.88%

The percentage indicates the performance overhead when compared to performance without enabling MTE tag memory.

Q1: What causes the increased performance overhead when enabling TCO on 531.deepsjeng_r and 526.blender_r? For instance, the overhead of stack-mte-ignore-tcf on 526.blender_r is 0%; yet, the overhead of stack-mte-ignore-tcf-enable-tco on 526.blender_r is 14.20%. As per the Arm® Architecture Reference Manual for A-profile architecture, enabling TCO disables tag checks.

Q2: The outcome observed in 510.parest_r contrasts with that of 531.deepsjeng_r/526.blender_r. Enabling TCO effectively diminishes the overhead on 510.parest_r. For instance, while the overhead of stack-mte-ignore-tcf on 510.parest_r is 6.82%, the overhead of stack-mte-ignore-tcf-enable-tco on 510.parest_r is 2.35%. Therefore, the question arises: why does enabling TCO have divergent effects on different benchmarks? While it decelerates performance on 531.deepsjeng_r and 526.blender_r, it accelerates performance on 510.parest_r.

 

The result on the heap is shown below(heap-mte-ignore-tcf means enable MTE tag memory for the heap and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):

Benchmark(SPEC2017 rate C++) heap_mte_ignore_tcf heap_mte_ignore_tcf_enable_tco heap_mte_async_tcf heap_mte_async_tcf_enable_tco heap_mte_sync_tcf heap_mte_sync_tcf_enable_tco
520.omnetpp_r 20.26% -7.45% 21.42% 0.70% 26.19% -9.31%
523.xalancbmk_r 29.95% 5.49% 24.18% 5.22% 37.09% 0.82%
531.deepsjeng_r 9.13% 1.14% 9.13% 0.46% 8.45% 0.00%
541.leela_r 0.42% 1.04% 0.21% 0.84% 0.63% 0.84%
508.namd_r 33.21% 0.37% 35.45% 0.37% 38.81% 0.37%
510.parest_r 12.18% 1.07% 11.97% 1.07% 11.75% 1.07%
511.povray_r 0.68% 5.41% 0.45% 5.63% 3.38% 6.53%
526.blender_r 7.91% 0.32% 7.91% 0.95% 8.23% 0.32%

When enabling MTE tag memory for the heap, most benchmarks slow down noticeably. The worst is 508.namd_r, with >30% performance overhead. For 511.povray_r, enabling TCO also slows down the performance. While for other benchmarks, enabling TCO can effectively reduce the performance overhead, except for 541.leela_r.

Q3:Is the performance overhead shown above considered reasonable? StickyTags highlights that frequent memory tagging (via STG instruction) is a significant performance bottleneck in current MTE-based solutions. However, our assessment reveals that solely enabling MTE tag memory(via mprotect(..., PROT_MTE...)) for the heap already leads to a noticeable performance overhead.

Your help would be very much appreciated. And thank you very much in advance.