Question about the performance overhead introduced by the MTE mechanism

Hello Community,

We are evaluating the performance introduced by MTE on the Pixel 8 Pro. We select SPEC2017 C++ rate suites as benchmarks and pin them on Cortex-A715 cores. The paper, StickyTags(ieeexplore.ieee.org/.../stamp.jsp, points out that frequent memory tagging (STG instruction) is a major performance bottleneck of existing MTE-based solutions. However, our evaluation results show more details and are confusing. Specifically, we only enable MTE tag memory for the stack and heap respectively(via mprotect(..., PROT_MTE...)), and evaluate on different MTE check modes: ignore mode, async mode, sync mode, and disable tag check. In particular, the User-mode process doesn't execute any STG instruction, i.e., there is no frequent memory tagging. We have run each benchmark at least 3 times, and select the average value as the last result.

The result on the stack is shown below(stack-mte-ignore-tcf means enable MTE tag memory for the stack and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):

Benchmark(SPEC2017 rate C++)	stack-mte-ignore-tcf	stack-mte-ignore-tcf-enable-tco	stack-mte-async-tcf	stack-mte-async-tcf-enable-tco	stack-mte-sync-tcf	stack-mte-sync-tcf-enale-tco
520.omnetpp_r	-1.32%	0.13%	-0.13%	0.00%	3.30%	3.04%
523.xalancbmk_r	2.06%	2.35%	5.59%	2.65%	3.24%	5.00%
531.deepsjeng_r	-2.97%	12.10%	-2.51%	11.42%	-1.83%	10.73%
541.leela_r	1.46%	2.71%	1.46%	2.51%	1.88%	3.34%
508.namd_r	-0.37%	0.00%	-0.37%	0.00%	-0.37%	0.00%
510.parest_r	6.82%	2.35%	6.61%	2.35%	6.61%	2.35%
511.povray_r	1.35%	3.37%	1.12%	3.82%	2.70%	3.60%
526.blender_r	0.00%	14.20%	0.63%	13.56%	0.63%	13.88%

The percentage indicates the performance overhead when compared to performance without enabling MTE tag memory.

Q1: What causes the increased performance overhead when enabling TCO on 531.deepsjeng_r and 526.blender_r? For instance, the overhead of stack-mte-ignore-tcf on 526.blender_r is 0%; yet, the overhead of stack-mte-ignore-tcf-enable-tco on 526.blender_r is 14.20%. As per the Arm® Architecture Reference Manual for A-profile architecture, enabling TCO disables tag checks.

Q2: The outcome observed in 510.parest_r contrasts with that of 531.deepsjeng_r/526.blender_r. Enabling TCO effectively diminishes the overhead on 510.parest_r. For instance, while the overhead of stack-mte-ignore-tcf on 510.parest_r is 6.82%, the overhead of stack-mte-ignore-tcf-enable-tco on 510.parest_r is 2.35%. Therefore, the question arises: why does enabling TCO have divergent effects on different benchmarks? While it decelerates performance on 531.deepsjeng_r and 526.blender_r, it accelerates performance on 510.parest_r.

The result on the heap is shown below(heap-mte-ignore-tcf means enable MTE tag memory for the heap and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):

Benchmark(SPEC2017 rate C++)	heap_mte_ignore_tcf	heap_mte_ignore_tcf_enable_tco	heap_mte_async_tcf	heap_mte_async_tcf_enable_tco	heap_mte_sync_tcf	heap_mte_sync_tcf_enable_tco
520.omnetpp_r	20.26%	-7.45%	21.42%	0.70%	26.19%	-9.31%
523.xalancbmk_r	29.95%	5.49%	24.18%	5.22%	37.09%	0.82%
531.deepsjeng_r	9.13%	1.14%	9.13%	0.46%	8.45%	0.00%
541.leela_r	0.42%	1.04%	0.21%	0.84%	0.63%	0.84%
508.namd_r	33.21%	0.37%	35.45%	0.37%	38.81%	0.37%
510.parest_r	12.18%	1.07%	11.97%	1.07%	11.75%	1.07%
511.povray_r	0.68%	5.41%	0.45%	5.63%	3.38%	6.53%
526.blender_r	7.91%	0.32%	7.91%	0.95%	8.23%	0.32%

When enabling MTE tag memory for the heap, most benchmarks slow down noticeably. The worst is 508.namd_r, with >30% performance overhead. For 511.povray_r, enabling TCO also slows down the performance. While for other benchmarks, enabling TCO can effectively reduce the performance overhead, except for 541.leela_r.

Q3:Is the performance overhead shown above considered reasonable? StickyTags highlights that frequent memory tagging (via STG instruction) is a significant performance bottleneck in current MTE-based solutions. However, our assessment reveals that solely enabling MTE tag memory(via mprotect(..., PROT_MTE...)) for the heap already leads to a noticeable performance overhead.

Your help would be very much appreciated. And thank you very much in advance.