Arm Neoverse N-Series and V-Series processors do not implement SMT (Simultaneous Multithreading) technology. When running on an Arm Neoverse processor, every thread always has access to the full resources of the processor. This makes for more predictable execution in a cloud environment, ensures every thread gets full access to processor resources, and provides stronger protection against unintended data leaks between threads.
In a processor that implements SMT technology, each physical processor is divided into two or more logical cores. These logical cores share some resources with each other. For example, a common design shares execution units used for cryptography, video, or AI processing along with other processor structures such as branch predictors, prefetchers, and caches. In an SMT system each logical core has its own registers and a program counter, enabling each logical core to execute an independent execution thread. Typical SMT implementations in the market include Intel’s Hyper-Threading and AMD’s SMT.
When comparing utilization between Arm Neoverse processors and other processor architectures that enable SMT, sometimes it can be the case that Arm appears to show higher CPU usage under a similar light load level. This can give operators the impression that Arm platforms have less headroom available for expansion. In this blog post, we will explore why the "CPU Utilization" metric may be misleading in light load scenarios when comparing systems with and without SMT.
Linux calculates CPU utilization based on whether the core is working on something or idle. Logical cores in SMT mode share the execution resources of the physical core. In light load situation, logical cores can run at the full speed of a physical core. In this case OS may show low CPU usage. But this does not mean the physical core load is that low because the loads from both logical cores are added to the physical core.
This can make a lightly loaded system look like it has more spare capacity on a SMT system than a system without SMT. For example, consider the following significantly simplified scenario.
Figure 1: CPU Usage Explanation Under SMT
When looking at the utilization of the logical cores in this example an operator may conclude they had capacity to handle 14 more web requests. However, they only have capacity to handle 4 more.
Note that this example is simplified from a realistic workload, which may see lower physical processor utilization depending on the dynamic conditions that enable sharing of physical processor resources between the logical cores. This unpredictability can be a challenge when trying to estimate how much headroom may be present on your system.
In this blog post, we will create a micro benchmark as well as using real workload to demonstrate this behavior on an SMT system.
The micro benchmark is lots of loads from small array and some simple math work, this could create large IPC that can fully utilize the available execution units of a physical processor. To simulate different load levels, we add different sleeps during calculation loops. Load level 1 is the lightest and 7 is the heaviest. On the SMT-Enabled system, we run the program first on one logical core, and then run on both logical cores of the same physical processors for all load levels. For non-SMT system, we use Arm Neoverse N2. We run on two cores that are attached to the same Arm CMN cross point. Note, we disabled vectorization in compiling because most code in real workloads isn't as vectorization friendly as this microbenchmark.
We run the micro benchmark on a popular SMT system running at 3.5GHz and an Arm Neoverse N2 based platform (2.7GHz) which does not have SMT.
On our system, logical core 0 and logical core 32 share the same physical core. So we will run on core 0 and core 32.
cr@ent-x86-15:/home/cr$ lstopo Authorization required, but no authorization protocol specified Machine (125GB total) Package L#0 NUMANode L#0 (P#0 62GB) L3 L#0 (24MB) L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#32) L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#34) L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#36) ......
Here is an example output of running the micro benchmark on two cores at the same time. 1000000000 is the rounds of compute work that we did with micro benchmark tool. 7 is load level we want to achieve. We will collect time elapsed as Total Run Time and insn per cycle as IPC from perf tool. CPU usage is got from top command.
# In one window cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000 Time elapsed: 147.196152 seconds Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000': 147,202.33 msec task-clock # 1.000 CPUs utilized 910 context-switches # 6.182 /sec 1 cpu-migrations # 0.007 /sec 146 page-faults # 0.992 /sec 499,711,000,190 cycles # 3.395 GHz 702,978,039,409 instructions # 1.41 insn per cycle 50,551,657,554 branches # 343.416 M/sec 1,601,409 branch-misses # 0.00% of all branches TopdownL1 # 9.4 % tma_backend_bound # 2.0 % tma_bad_speculation # 19.0 % tma_frontend_bound # 69.6 % tma_retiring 147.235361847 seconds time elapsed 147.198440000 seconds user 0.000000000 seconds sys # In another window cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 32 ./high_ipc 7 1000000000 Time elapsed: 147.195379 seconds Performance counter stats for 'taskset -c 32 ./high_ipc 7 1000000000': 147,196.87 msec task-clock # 1.000 CPUs utilized 44 context-switches # 0.299 /sec 1 cpu-migrations # 0.007 /sec 144 page-faults # 0.978 /sec 499,711,402,451 cycles # 3.395 GHz 702,986,727,897 instructions # 1.41 insn per cycle 50,552,160,460 branches # 343.432 M/sec 1,400,427 branch-misses # 0.00% of all branches TopdownL1 # 9.0 % tma_backend_bound # 2.0 % tma_bad_speculation # 23.2 % tma_frontend_bound # 65.8 % tma_retiring 147.201061034 seconds time elapsed 147.193656000 seconds user 0.003999000 seconds sys
Here are the results collected:
From the data, we can see that start from load level 3 and 4, the combined CPU usage of LCor 0 and LCore 1 gradually exceeds 100%, and correspondingly the IPC drops substantially for two cores.
On Neoverse N2 platform, each core is independent physical core. We will run on core 0 and 1.
# In one window cr@ wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000 Time elapsed: 140.039919 seconds Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000': 140,041.55 msec task-clock # 1.000 CPUs utilized 534 context-switches # 3.813 /sec 1 cpu-migrations # 0.007 /sec 115 page-faults # 0.821 /sec 384,908,584,592 cycles # 2.749 GHz (42.85%) 952,810,598,431 instructions # 2.48 insn per cycle (57.14%) 50,555,565,095 branches # 361.004 M/sec (71.43%) 451,503 branch-misses # 0.00% of all branches (71.43%) TopdownL1 # 0.0 % bad_speculation # 46.9 % retiring (57.15%) # 27.0 % frontend_bound (42.86%) # 26.2 % backend_bound (28.57%) 140.044867984 seconds time elapsed 140.042186000 seconds user 0.000000000 seconds sys # In another window cr@wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 1 ./high_ipc 7 1000000000 Time elapsed: 140.028971 seconds Performance counter stats for 'taskset -c 1 ./high_ipc 7 1000000000': 140,030.50 msec task-clock # 1.000 CPUs utilized 38 context-switches # 0.271 /sec 1 cpu-migrations # 0.007 /sec 114 page-faults # 0.814 /sec 384,880,893,997 cycles # 2.749 GHz (42.86%) 952,883,898,513 instructions # 2.48 insn per cycle (57.14%) 50,559,218,606 branches # 361.059 M/sec (71.43%) 555,101 branch-misses # 0.00% of all branches (71.43%) TopdownL1 # 0.0 % bad_speculation # 46.9 % retiring (57.14%) # 27.1 % frontend_bound (42.86%) # 26.0 % backend_bound (28.57%) 140.026846716 seconds time elapsed 140.031190000 seconds user 0.000000000 seconds sys
The result shows that IPC for two cores does not have much difference from one core on Arm Neoverse system. And, it does not drop from light load level to heavy load level. This is quite different from SMT system. It means the capability of CPU keeps consistent in all CPU utilization levels.
It is not easy to find a reliable direct metric to tell the busy status of physical cores, but we can tell from other angles. Here we pick IPC to analyze. We use data from 2 cores' run divided by data from 1 core's run. This shows the character of SMT systems.
In the following chart, we see in light load situation, IPC is almost the same for 2 cores and one core for SMT system. This is because physical core is not busy, all operations from logical cores are served almost at the full speed of physical core, which makes the IPC high. And CPU usage looks low. But the physical core is running double the loads. So, the usage of the physical CPU should be doubled. As the load increases, the IPCs for 2 cores decrease dramatically compared to 1 core. Because they need to compete and share the resource from execution engine. The physical core becomes fully utilized that it cannot handle all the requests from two cores in time as in light load situation. The IPC drops to almost half in the end. And this proves the two logical cores are sharing the physical core.
In contrast, Arm Neoverse system keeps at 1 because the two cores are independent, and they are always run at the same speed as one core no matter what the load level is.
Figure 2: Micro Benchmark IPC Comparison
Here we define performance as "1000000000/Total Run Time". 1000000000 is the rounds of tests that we did with macro benchmark tool. And since the frequency of the N2 platform we use (2.7Ghz) is much lower than SMT sytem (3.5Ghz), we adjust the performance data of N2 platform to the same 3.5Ghz. We do this for 2 core case which is the normal use case of production environment that uses both logical cores of a physical core.
Then we got the following chart of Frequency Adjusted Performance Vs CPU Usage. We can see that as CPU usage increases, SMT system performance output becomes flattened. Meanwhile, Arm's performance output keeps a perfect linear increasing trend and finally surpass SMT system in high CPU usage area. So, if you just use performance output in low CPU usage area to predict performance in high CPU usage situation for SMT systems, you may get wrong result.
Figure 3: Micro Benchmark Adjusted Performance
Here we just created a "Performance Achievement Rate" metric which is "Performance / CPU Usage". Performance data is from the previous table. A perfect system should see a constant ratio for a workload, which means certain CPU usage should give certain performance output.
From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT System. But it behaves much better and constantly on Arm.
Figure 4: Micro Benchmark Adjusted Performance Achievement Rate
Trying to describe the issue with a well-known workload. Here I used Flink and Nexmark, which is used to benchmark Flink.
We created two clusters with similar hardware resources for SMT System system and Arm Neoverse N2 base system which does not have SMT. Software version and configurations are also the same.
CPU
Nexmark has several test cases, here we pick Q0 test for the comparison. Other tests may have similar results.
Test Result of Nexmark Q0 test is shown below. The CPU usage from top tool for the same load level (TPS) is different for SMT system and Arm.
We can see with lower workload level, Arm CPU usage observed is higher than SMT system, but after about 50% CPU usage level, SMT system CPU utilization increases much faster. And finally, we see the CPU usage getting much higher than Arm with high CPU load. And with full CPU usage, Arm generates much higher TPS.
Figure 5: Flink CPU Usage Under Different TPS
Also, we can see it from another angle: trends of how many TPS it can generate under the same CPU usage level. Arm firstly performs a little worse, but after about 50% CPU usage, Arm becomes much better.
Figure 6: Flink TPS Under Different CPU Usage
Again, here we created a "Perf Achievement Rate" metric which is TPS/CPU_Usage. A perfect system should see a constant ratio for a workload. From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT system . But it behaves much better on Arm.
Figure 7: Flink Performance Achievement Rate
So, this test case for Flink shows a similar result as we see in the micro benchmark test. Arm may show higher CPU usage under light loads. However, beyond a certain point, an SMT system's CPU usage can quickly surpass Arm's, resulting in lower performance output under high CPU usage conditions.
The CPU usage of SMT-enabled systems may be underestimated in light load situations. When estimating how much additional capacity is available for workloads we could deploy on a machine, The CPU utilization metric can be misleading. Instead, we should test performance output level with full CPU usage and then leave a performance buffer based on that value.