Hello Community,
We are evaluating the performance introduced by MTE on the Pixel 8 Pro. We select SPEC2017 C++ rate suites as benchmarks and pin them on Cortex-A715 cores. The paper, StickyTags(ieeexplore.ieee.org/.../stamp.jsp, points out that frequent memory tagging (STG instruction) is a major performance bottleneck of existing MTE-based solutions. However, our evaluation results show more details and are confusing. Specifically, we only enable MTE tag memory for the stack and heap respectively(via mprotect(..., PROT_MTE...)), and evaluate on different MTE check modes: ignore mode, async mode, sync mode, and disable tag check. In particular, the User-mode process doesn't execute any STG instruction, i.e., there is no frequent memory tagging. We have run each benchmark at least 3 times, and select the average value as the last result.
The result on the stack is shown below(stack-mte-ignore-tcf means enable MTE tag memory for the stack and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):
The percentage indicates the performance overhead when compared to performance without enabling MTE tag memory.
Q1: What causes the increased performance overhead when enabling TCO on 531.deepsjeng_r and 526.blender_r? For instance, the overhead of stack-mte-ignore-tcf on 526.blender_r is 0%; yet, the overhead of stack-mte-ignore-tcf-enable-tco on 526.blender_r is 14.20%. As per the Arm® Architecture Reference Manual for A-profile architecture, enabling TCO disables tag checks.
Q2: The outcome observed in 510.parest_r contrasts with that of 531.deepsjeng_r/526.blender_r. Enabling TCO effectively diminishes the overhead on 510.parest_r. For instance, while the overhead of stack-mte-ignore-tcf on 510.parest_r is 6.82%, the overhead of stack-mte-ignore-tcf-enable-tco on 510.parest_r is 2.35%. Therefore, the question arises: why does enabling TCO have divergent effects on different benchmarks? While it decelerates performance on 531.deepsjeng_r and 526.blender_r, it accelerates performance on 510.parest_r.
The result on the heap is shown below(heap-mte-ignore-tcf means enable MTE tag memory for the heap and ignore tag check fault, enable-tco means disable MTE tag check via "msr tco, #1"):
When enabling MTE tag memory for the heap, most benchmarks slow down noticeably. The worst is 508.namd_r, with >30% performance overhead. For 511.povray_r, enabling TCO also slows down the performance. While for other benchmarks, enabling TCO can effectively reduce the performance overhead, except for 541.leela_r.
Q3:Is the performance overhead shown above considered reasonable? StickyTags highlights that frequent memory tagging (via STG instruction) is a significant performance bottleneck in current MTE-based solutions. However, our assessment reveals that solely enabling MTE tag memory(via mprotect(..., PROT_MTE...)) for the heap already leads to a noticeable performance overhead.
Your help would be very much appreciated. And thank you very much in advance.