Reassess CPU utilization on Simultaneous Multithreading Enabled Systems

September 19, 2024

12 minute read time.

Arm Neoverse N-Series and V-Series processors do not implement SMT (Simultaneous Multithreading) technology. When running on an Arm Neoverse processor, every thread always has access to the full resources of the processor. This makes for more predictable execution in a cloud environment, ensures every thread gets full access to processor resources, and provides stronger protection against unintended data leaks between threads.

In a processor that implements SMT technology, each physical processor is divided into two or more logical cores. These logical cores share some resources with each other. For example, a common design shares execution units used for cryptography, video, or AI processing along with other processor structures such as branch predictors, prefetchers, and caches. In an SMT system each logical core has its own registers and a program counter, enabling each logical core to execute an independent execution thread. Typical SMT implementations in the market include Intel’s Hyper-Threading and AMD’s SMT.

When comparing utilization between Arm Neoverse processors and other processor architectures that enable SMT, sometimes it can be the case that Arm appears to show higher CPU usage under a similar light load level. This can give operators the impression that Arm platforms have less headroom available for expansion. In this blog post, we will explore why the "CPU Utilization" metric may be misleading in light load scenarios when comparing systems with and without SMT.

Measuring CPU utilization

Linux calculates CPU utilization based on whether the core is working on something or idle. Logical cores in SMT mode share the execution resources of the physical core. In light load situation, logical cores can run at the full speed of a physical core. In this case OS may show low CPU usage. But this does not mean the physical core load is that low because the loads from both logical cores are added to the physical core.

This can make a lightly loaded system look like it has more spare capacity on a SMT system than a system without SMT. For example, consider the following significantly simplified scenario.

Two logical cores, LCore0 and LCore1, run on one physical processor
Each logical core is running a workload which processes incoming web requests. These use the entirety of the CPU for 0.1 seconds, then return to waiting.
In 1 second time frame, LCore0 and LCore1 service three requests each
LCore0 executes for 0.3 and is idle for 0.7 seconds, so calculates 30% CPU usage.
LCore1 does the same, also calculates 30% CPU usage.
Adding the two together, the physical processor was fully utilized for 0.6 seconds, so was 60% utilized.

^{Figure 1: CPU Usage Explanation Under SMT}

When looking at the utilization of the logical cores in this example an operator may conclude they had capacity to handle 14 more web requests. However, they only have capacity to handle 4 more.

Note that this example is simplified from a realistic workload, which may see lower physical processor utilization depending on the dynamic conditions that enable sharing of physical processor resources between the logical cores. This unpredictability can be a challenge when trying to estimate how much headroom may be present on your system.

In this blog post, we will create a micro benchmark as well as using real workload to demonstrate this behavior on an SMT system.

Micro Benchmark

The micro benchmark is lots of loads from small array and some simple math work, this could create large IPC that can fully utilize the available execution units of a physical processor. To simulate different load levels, we add different sleeps during calculation loops. Load level 1 is the lightest and 7 is the heaviest. On the SMT-Enabled system, we run the program first on one logical core, and then run on both logical cores of the same physical processors for all load levels. For non-SMT system, we use Arm Neoverse N2. We run on two cores that are attached to the same Arm CMN cross point. Note, we disabled vectorization in compiling because most code in real workloads isn't as vectorization friendly as this microbenchmark.

Run the test

We run the micro benchmark on a popular SMT system running at 3.5GHz and an Arm Neoverse N2 based platform (2.7GHz) which does not have SMT.

SMT

On our system, logical core 0 and logical core 32 share the same physical core. So we will run on core 0 and core 32.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cr@ent-x86-15:/home/cr$ lstopo
Authorization required, but no authorization protocol specified
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 62GB)
    L3 L#0 (24MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#34)
      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#36)
......
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

cr@ent-x86-15:/home/cr$ lstopo
Authorization required, but no authorization protocol specified
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 62GB)
    L3 L#0 (24MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#34)
      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#36)
......

Here is an example output of running the micro benchmark on two cores at the same time. 1000000000 is the rounds of compute work that we did with micro benchmark tool. 7 is load level we want to achieve. We will collect time elapsed as Total Run Time and insn per cycle as IPC from perf tool. CPU usage is got from top command.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# In one window
cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 147.196152 seconds
 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':
        147,202.33 msec task-clock                       #    1.000 CPUs utilized
               910      context-switches                 #    6.182 /sec
                 1      cpu-migrations                   #    0.007 /sec
               146      page-faults                      #    0.992 /sec
   499,711,000,190      cycles                           #    3.395 GHz
   702,978,039,409      instructions                     #    1.41  insn per cycle
    50,551,657,554      branches                         #  343.416 M/sec
         1,601,409      branch-misses                    #    0.00% of all branches
                        TopdownL1                 #      9.4 %  tma_backend_bound
                                                  #      2.0 %  tma_bad_speculation
                                                  #     19.0 %  tma_frontend_bound
                                                  #     69.6 %  tma_retiring
     147.235361847 seconds time elapsed
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# In one window

cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 147.196152 seconds

 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':

        147,202.33 msec task-clock                       #    1.000 CPUs utilized
               910      context-switches                 #    6.182 /sec
                 1      cpu-migrations                   #    0.007 /sec
               146      page-faults                      #    0.992 /sec
   499,711,000,190      cycles                           #    3.395 GHz
   702,978,039,409      instructions                     #    1.41  insn per cycle
    50,551,657,554      branches                         #  343.416 M/sec
         1,601,409      branch-misses                    #    0.00% of all branches
                        TopdownL1                 #      9.4 %  tma_backend_bound
                                                  #      2.0 %  tma_bad_speculation
                                                  #     19.0 %  tma_frontend_bound
                                                  #     69.6 %  tma_retiring

     147.235361847 seconds time elapsed

     147.198440000 seconds user
       0.000000000 seconds sys


# In another window

cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 32 ./high_ipc 7 1000000000
Time elapsed: 147.195379 seconds

 Performance counter stats for 'taskset -c 32 ./high_ipc 7 1000000000':

        147,196.87 msec task-clock                       #    1.000 CPUs utilized
                44      context-switches                 #    0.299 /sec
                 1      cpu-migrations                   #    0.007 /sec
               144      page-faults                      #    0.978 /sec
   499,711,402,451      cycles                           #    3.395 GHz
   702,986,727,897      instructions                     #    1.41  insn per cycle
50,552,160,460      branches                         #  343.432 M/sec
         1,400,427      branch-misses                    #    0.00% of all branches
                        TopdownL1                 #      9.0 %  tma_backend_bound
                                                  #      2.0 %  tma_bad_speculation
                                                  #     23.2 %  tma_frontend_bound
                                                  #     65.8 %  tma_retiring

     147.201061034 seconds time elapsed

     147.193656000 seconds user
       0.003999000 seconds sys

Here are the results collected:

	Total Run Time		IPC		CPU Usage
Load Level	LCore 0	LCore 32	LCore 0	LCore 32	LCore 0	LCore 32
1 - one core	513		2.69		21%
1 - two cores	520	520	2.52	2.5	22%	22%
2 - one core	293		2.75		31%
2 - two cores	299	299	2.56	2.56	34%	34%
3 - one core	205		2.77		41%
3 - two cores	213	213	2.53	2.53	43%	43%
4 - one core	139		2.78		48%
4 - two cores	167	167	2.08	2.08	67%	67%
5 - one core	87		2.8		86%
5 - two cores	151	151	1.5	1.5	93%	93%
6 - one core	83		2.79		93%
6 - two cores	151	151	1.46	1.46	95%	95%
7 - one core	74		2.81		100%
7 - two cores	147	147	1.41	1.41	100%	100%

From the data, we can see that start from load level 3 and 4, the combined CPU usage of LCor 0 and LCore 1 gradually exceeds 100%, and correspondingly the IPC drops substantially for two cores.

Non-SMT

On Neoverse N2 platform, each core is independent physical core. We will run on core 0 and 1.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# In one window
cr@ wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 140.039919 seconds
 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':
        140,041.55 msec task-clock                       #    1.000 CPUs utilized
               534      context-switches                 #    3.813 /sec
                 1      cpu-migrations                   #    0.007 /sec
               115      page-faults                      #    0.821 /sec
   384,908,584,592      cycles                           #    2.749 GHz                         (42.85%)
   952,810,598,431      instructions                     #    2.48  insn per cycle              (57.14%)
    50,555,565,095      branches                         #  361.004 M/sec                       (71.43%)
           451,503      branch-misses                    #    0.00% of all branches             (71.43%)
                        TopdownL1                 #      0.0 %  bad_speculation
                                                  #     46.9 %  retiring                 (57.15%)
                                                  #     27.0 %  frontend_bound           (42.86%)
                                                  #     26.2 %  backend_bound            (28.57%)
     140.044867984 seconds time elapsed
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# In one window
cr@ wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 140.039919 seconds

 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':

        140,041.55 msec task-clock                       #    1.000 CPUs utilized
               534      context-switches                 #    3.813 /sec
                 1      cpu-migrations                   #    0.007 /sec
               115      page-faults                      #    0.821 /sec
   384,908,584,592      cycles                           #    2.749 GHz                         (42.85%)
   952,810,598,431      instructions                     #    2.48  insn per cycle              (57.14%)
    50,555,565,095      branches                         #  361.004 M/sec                       (71.43%)
           451,503      branch-misses                    #    0.00% of all branches             (71.43%)
                        TopdownL1                 #      0.0 %  bad_speculation
                                                  #     46.9 %  retiring                 (57.15%)
                                                  #     27.0 %  frontend_bound           (42.86%)
                                                  #     26.2 %  backend_bound            (28.57%)

     140.044867984 seconds time elapsed

     140.042186000 seconds user
       0.000000000 seconds sys


# In another window

cr@wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 1 ./high_ipc 7 1000000000
Time elapsed: 140.028971 seconds

 Performance counter stats for 'taskset -c 1 ./high_ipc 7 1000000000':

        140,030.50 msec task-clock                       #    1.000 CPUs utilized
                38      context-switches                 #    0.271 /sec
                 1      cpu-migrations                   #    0.007 /sec
               114      page-faults                      #    0.814 /sec
   384,880,893,997      cycles                           #    2.749 GHz                         (42.86%)
   952,883,898,513      instructions                     #    2.48  insn per cycle              (57.14%)
    50,559,218,606      branches                         #  361.059 M/sec                       (71.43%)
           555,101      branch-misses                    #    0.00% of all branches             (71.43%)
                        TopdownL1                 #      0.0 %  bad_speculation
                                                  #     46.9 %  retiring                 (57.14%)
                                                  #     27.1 %  frontend_bound           (42.86%)
                                                  #     26.0 %  backend_bound            (28.57%)

     140.026846716 seconds time elapsed

     140.031190000 seconds user
       0.000000000 seconds sys

The result shows that IPC for two cores does not have much difference from one core on Arm Neoverse system. And, it does not drop from light load level to heavy load level. This is quite different from SMT system. It means the capability of CPU keeps consistent in all CPU utilization levels.

	Total Run Time		IPC		CPU Usage
Load Level	Core 0	Core 1	Core 0	Core 1	Core 0	Core 1
1 - one core	563		2.47		29%
1 - two cores	563	563	2.48	2.47	29%	29%
2 - one core	351		2.48		44%
2 - two cores	351	351	2.47	2.48	43%	43%
3 - one core	267		2.47		56%
3 - two cores	267	267	2.48	2.48	55%	55%
4 - one core	204		2.47		70%
4 - two cores	203	203	2.47	2.47	71%	71%
5 - one core	153		2.47		92%
5 - two cores	152	152	2.47	2.47	92%	92%
6 - one core	149		2.47		94%
6 - two cores	149	149	2.48	2.48	94%	94%
7 - one core	140		2.47		100%
7 - two cores	140	140	2.48	2.48	100%	100%

Further analysis of results

Two Hardware Thread Impact

It is not easy to find a reliable direct metric to tell the busy status of physical cores, but we can tell from other angles. Here we pick IPC to analyze. We use data from 2 cores' run divided by data from 1 core's run. This shows the character of SMT systems.

Load Level	IPC 2 Cores/1 Core
Load Level	SMT System	Non-SMT System
1	0.93	1.00
2	0.93	1.00
3	0.91	1.00
4	0.75	1.00
5	0.54	1.00
6	0.52	1.00
7	0.50	1.00

In the following chart, we see in light load situation, IPC is almost the same for 2 cores and one core for SMT system. This is because physical core is not busy, all operations from logical cores are served almost at the full speed of physical core, which makes the IPC high. And CPU usage looks low. But the physical core is running double the loads. So, the usage of the physical CPU should be doubled. As the load increases, the IPCs for 2 cores decrease dramatically compared to 1 core. Because they need to compete and share the resource from execution engine. The physical core becomes fully utilized that it cannot handle all the requests from two cores in time as in light load situation. The IPC drops to almost half in the end. And this proves the two logical cores are sharing the physical core.

In contrast, Arm Neoverse system keeps at 1 because the two cores are independent, and they are always run at the same speed as one core no matter what the load level is.

^{Figure 2: Micro Benchmark IPC Comparison}

Performance vs. CPU usage

Here we define performance as "1000000000/Total Run Time". 1000000000 is the rounds of tests that we did with macro benchmark tool. And since the frequency of the N2 platform we use (2.7Ghz) is much lower than SMT sytem (3.5Ghz), we adjust the performance data of N2 platform to the same 3.5Ghz. We do this for 2 core case which is the normal use case of production environment that uses both logical cores of a physical core.

CPU Usage	Performance		Adjusted Performance
CPU Usage	SMT System	Non-SMT System	SMT System	Non-SMT System
22%	1923077		1923077
29%		1776199		2302480
34%	3344482		3344482
43%	4694836	2849003	4694836	3693152
55%		3745318		4855042
67%	5988024		5988024
71%		4926108		6385696
92%		6578947		8528265
93%	6622517		6622517
94%		6711409		8699975
95%	6622517		6622517
100%	6802721	7142857	6802721	9259259

Then we got the following chart of Frequency Adjusted Performance Vs CPU Usage. We can see that as CPU usage increases, SMT system performance output becomes flattened. Meanwhile, Arm's performance output keeps a perfect linear increasing trend and finally surpass SMT system in high CPU usage area. So, if you just use performance output in low CPU usage area to predict performance in high CPU usage situation for SMT systems, you may get wrong result.

^{Figure 3: Micro Benchmark Adjusted Performance}

Here we just created a "Performance Achievement Rate" metric which is "Performance / CPU Usage". Performance data is from the previous table. A perfect system should see a constant ratio for a workload, which means certain CPU usage should give certain performance output.

CPU Usage	Performance Achievement Rate		Adjusted Performance Achievement Rate
CPU Usage	SMT System	Non-SMT System	SMT System	Non-SMT System
22%	8741259		8741259
29%		6124824		7939587
34%	9836711		9836711
43%	10918223	6625588	10918223	8588725
55%		6809670		8827350
67%	8937349		8937349
71%		6938181		8993938
92%		7151030		9269853
93%	7120986		7120986
94%		7139797		9255293
95%	7007954		7007954
100%	6802721	7142857	6802721	9259259

From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT System. But it behaves much better and constantly on Arm.

^{Figure 4: Micro Benchmark Adjusted Performance Achievement Rate}

Real workload

Trying to describe the issue with a well-known workload. Here I used Flink and Nexmark, which is used to benchmark Flink.

Test description

We created two clusters with similar hardware resources for SMT System system and Arm Neoverse N2 base system which does not have SMT. Software version and configurations are also the same.

CPU

Hardware config
	Arm Neoverse N2 System	SMT System
	Arm Neoverse N2 CPU @ 3.0 GHz	SMT Enabled CPU @ 3.5GHz
cluster	1Master + 3Worker Nodes	1Master + 3Worker Nodes
cores	32 cores per node	32 cores per node
memory	128G	128G
flink version	1.17.1	1.17.1
nexmark version	0.2 release	0.2 release

Flink taskmanager config
	Arm Neoverse N2 System	SMT System
jobmanager.memory.process.size	32G	32G
taskmanager.memory.process.size	13G	13G
taskmanager.numberOfTaskSlots	8	8
parallelism.default	192	192
worker (taskmanager) number	24 workers total (8 workers per node)

Nexmark has several test cases, here we pick Q0 test for the comparison. Other tests may have similar results.

Test result

Test Result of Nexmark Q0 test is shown below. The CPU usage from top tool for the same load level (TPS) is different for SMT system and Arm.

TPS (M)	CPU Usage
TPS (M)	Arm	SMT
1.0	5.9%	4.0%
2.0	9.5%	7.2%
4.0	16.5%	13.9%
6.0	23.6%	20.8%
10.0	37.3%	36.6%
14.0	52.7%	56.8%
18.0	70.7%	85.0%
20.0	81.9%	95.3%
21.7	90.4%
23.8	94.2%

We can see with lower workload level, Arm CPU usage observed is higher than SMT system, but after about 50% CPU usage level, SMT system CPU utilization increases much faster. And finally, we see the CPU usage getting much higher than Arm with high CPU load. And with full CPU usage, Arm generates much higher TPS.

^{Figure 5: Flink CPU Usage Under Different TPS}

Also, we can see it from another angle: trends of how many TPS it can generate under the same CPU usage level. Arm firstly performs a little worse, but after about 50% CPU usage, Arm becomes much better.

^{Figure 6: Flink TPS Under Different CPU Usage}

Again, here we created a "Perf Achievement Rate" metric which is TPS/CPU_Usage. A perfect system should see a constant ratio for a workload. From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT system . But it behaves much better on Arm.

^{Figure 7: Flink Performance Achievement Rate}

So, this test case for Flink shows a similar result as we see in the micro benchmark test. Arm may show higher CPU usage under light loads. However, beyond a certain point, an SMT system's CPU usage can quickly surpass Arm's, resulting in lower performance output under high CPU usage conditions.

Conclusion

The CPU usage of SMT-enabled systems may be underestimated in light load situations. When estimating how much additional capacity is available for workloads we could deploy on a machine, The CPU utilization metric can be misleading. Instead, we should test performance output level with full CPU usage and then leave a performance buffer based on that value.

0 comments
0 members are here

Servers and Cloud Computing blog

Distributed Generative AI Inference on Arm

Waheed Brown

As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference…
- August 18, 2025
Unlocking Performance and Cost Savings with Arm Neoverse-powered AWS Graviton for Zilliz Cloud

Jiang Chen

This blog explores how Zilliz Cloud migrated from x86 to Arm CPUs to boost performance and slash costs with Arm Neoverse CPUs for compute-intensive AI workloads, scalable vector search and RAG pipelines…
- July 21, 2025
Introducing New Sparse Functions in Arm Performance Libraries 25.07

Chris Armstrong

In this blog, we introduce the new sparse functions added in Arm Performance Libraries 25.07. We also take a closer look at new features and share performance insights based on benchmarks running on Arm…
- July 16, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Reassess CPU utilization on Simultaneous Multithreading Enabled Systems

Measuring CPU utilization

Micro Benchmark

Run the test

SMT

Non-SMT

Further analysis of results

Two Hardware Thread Impact

Performance vs. CPU usage

Real workload

Test description

Test result

Conclusion

Distributed Generative AI Inference on Arm

Unlocking Performance and Cost Savings with Arm Neoverse-powered AWS Graviton for Zilliz Cloud

Introducing New Sparse Functions in Arm Performance Libraries 25.07