Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Reassess CPU utilization on Simultaneous Multithreading Enabled Systems
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • CPU Usage
  • SMT
  • performance
  • cpu
  • Hyper-threading
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Reassess CPU utilization on Simultaneous Multithreading Enabled Systems

Rui Chang
Rui Chang
September 19, 2024
12 minute read time.

Arm Neoverse N-Series and V-Series processors do not implement SMT (Simultaneous Multithreading) technology. When running on an Arm Neoverse processor, every thread always has access to the full resources of the processor. This makes for more predictable execution in a cloud environment, ensures every thread gets full access to processor resources, and provides stronger protection against unintended data leaks between threads.

In a processor that implements SMT technology, each physical processor is divided into two or more logical cores. These logical cores share some resources with each other. For example, a common design shares execution units used for cryptography, video, or AI processing along with other processor structures such as branch predictors, prefetchers, and caches. In an SMT system each logical core has its own registers and a program counter, enabling each logical core to execute an independent execution thread. Typical SMT implementations in the market include Intel’s Hyper-Threading and AMD’s SMT.

When comparing utilization between Arm Neoverse processors and other processor architectures that enable SMT, sometimes it can be the case that Arm appears to show higher CPU usage under a similar light load level. This can give operators the impression that Arm platforms have less headroom available for expansion. In this blog post, we will explore why the "CPU Utilization" metric may be misleading in light load scenarios when comparing systems with and without SMT.

Measuring CPU utilization

Linux calculates CPU utilization based on whether the core is working on something or idle. Logical cores in SMT mode share the execution resources of the physical core. In light load situation, logical cores can run at the full speed of a physical core. In this case OS may show low CPU usage. But this does not mean the physical core load is that low because the loads from both logical cores are added to the physical core.

This can make a lightly loaded system look like it has more spare capacity on a SMT system than a system without SMT. For example, consider the following significantly simplified scenario.

  • Two logical cores, LCore0 and LCore1, run on one physical processor
  • Each logical core is running a workload which processes incoming web requests. These use the entirety of the CPU for 0.1 seconds, then return to waiting.
  • In 1 second time frame, LCore0 and LCore1 service three requests each
  • LCore0 executes for 0.3 and is idle for 0.7 seconds, so calculates 30% CPU usage.
  • LCore1 does the same, also calculates 30% CPU usage.
  • Adding the two together, the physical processor was fully utilized for 0.6 seconds, so was 60% utilized.

Figure 1: CPU Usage Explanation Under SMT

When looking at the utilization of the logical cores in this example an operator may conclude they had capacity to handle 14 more web requests. However, they only have capacity to handle 4 more.

Note that this example is simplified from a realistic workload, which may see lower physical processor utilization depending on the dynamic conditions that enable sharing of physical processor resources between the logical cores. This unpredictability can be a challenge when trying to estimate how much headroom may be present on your system.

In this blog post, we will create a micro benchmark as well as using real workload to demonstrate this behavior on an SMT system. 

Micro Benchmark

The micro benchmark is lots of loads from small array and some simple math work, this could create large IPC that can fully utilize the available execution units of a physical processor. To simulate different load levels, we add different sleeps during calculation loops. Load level 1 is the lightest and 7 is the heaviest. On the SMT-Enabled system, we run the program first on one logical core, and then run on both logical cores of the same physical processors for all load levels. For non-SMT system, we use Arm Neoverse N2. We run on two cores that are attached to the same Arm CMN cross point. Note, we disabled vectorization in compiling because most code in real workloads isn't as vectorization friendly as this microbenchmark.

Run the test

We run the micro benchmark on a popular SMT system running at 3.5GHz and an Arm Neoverse N2 based platform (2.7GHz) which does not have SMT.

SMT

On our system, logical core 0 and logical core 32 share the same physical core. So we will run on core 0 and core 32. 

cr@ent-x86-15:/home/cr$ lstopo
Authorization required, but no authorization protocol specified
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 62GB)
    L3 L#0 (24MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#34)
      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#36)
......

Here is an example output of running the micro benchmark on two cores at the same time. 1000000000 is the rounds of compute work that we did with micro benchmark tool. 7 is load level we want to achieve. We will collect time elapsed as Total Run Time and insn per cycle as IPC from perf tool. CPU usage is got from top command. 

# In one window

cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 147.196152 seconds

 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':

        147,202.33 msec task-clock                       #    1.000 CPUs utilized
               910      context-switches                 #    6.182 /sec
                 1      cpu-migrations                   #    0.007 /sec
               146      page-faults                      #    0.992 /sec
   499,711,000,190      cycles                           #    3.395 GHz
   702,978,039,409      instructions                     #    1.41  insn per cycle
    50,551,657,554      branches                         #  343.416 M/sec
         1,601,409      branch-misses                    #    0.00% of all branches
                        TopdownL1                 #      9.4 %  tma_backend_bound
                                                  #      2.0 %  tma_bad_speculation
                                                  #     19.0 %  tma_frontend_bound
                                                  #     69.6 %  tma_retiring

     147.235361847 seconds time elapsed

     147.198440000 seconds user
       0.000000000 seconds sys


# In another window

cr@ent-x86-15:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 32 ./high_ipc 7 1000000000
Time elapsed: 147.195379 seconds

 Performance counter stats for 'taskset -c 32 ./high_ipc 7 1000000000':

        147,196.87 msec task-clock                       #    1.000 CPUs utilized
                44      context-switches                 #    0.299 /sec
                 1      cpu-migrations                   #    0.007 /sec
               144      page-faults                      #    0.978 /sec
   499,711,402,451      cycles                           #    3.395 GHz
   702,986,727,897      instructions                     #    1.41  insn per cycle
50,552,160,460      branches                         #  343.432 M/sec
         1,400,427      branch-misses                    #    0.00% of all branches
                        TopdownL1                 #      9.0 %  tma_backend_bound
                                                  #      2.0 %  tma_bad_speculation
                                                  #     23.2 %  tma_frontend_bound
                                                  #     65.8 %  tma_retiring

     147.201061034 seconds time elapsed

     147.193656000 seconds user
       0.003999000 seconds sys

Here are the results collected:

Total Run Time IPC CPU Usage
Load Level LCore 0 LCore 32 LCore 0 LCore 32 LCore 0 LCore 32
1 - one core 513 2.69 21%
1 - two cores 520 520 2.52 2.5 22% 22%
2 - one core 293 2.75 31%
2 - two cores 299 299 2.56 2.56 34% 34%
3 - one core 205 2.77 41%
3 - two cores 213 213 2.53 2.53 43% 43%
4 - one core 139 2.78 48%
4 - two cores 167 167 2.08 2.08 67% 67%
5 - one core 87 2.8 86%
5 - two cores 151 151 1.5 1.5 93% 93%
6 - one core 83 2.79 93%
6 - two cores 151 151 1.46 1.46 95% 95%
7 - one core 74 2.81 100%
7 - two cores 147 147 1.41 1.41 100% 100%

From the data, we can see that start from load level 3 and 4, the combined CPU usage of LCor 0 and LCore 1 gradually exceeds 100%, and correspondingly the IPC drops substantially for two cores.

Non-SMT

On Neoverse N2 platform, each core is independent physical core. We will run on core 0 and 1. 

# In one window
cr@ wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 0 ./high_ipc 7 1000000000
Time elapsed: 140.039919 seconds

 Performance counter stats for 'taskset -c 0 ./high_ipc 7 1000000000':

        140,041.55 msec task-clock                       #    1.000 CPUs utilized
               534      context-switches                 #    3.813 /sec
                 1      cpu-migrations                   #    0.007 /sec
               115      page-faults                      #    0.821 /sec
   384,908,584,592      cycles                           #    2.749 GHz                         (42.85%)
   952,810,598,431      instructions                     #    2.48  insn per cycle              (57.14%)
    50,555,565,095      branches                         #  361.004 M/sec                       (71.43%)
           451,503      branch-misses                    #    0.00% of all branches             (71.43%)
                        TopdownL1                 #      0.0 %  bad_speculation
                                                  #     46.9 %  retiring                 (57.15%)
                                                  #     27.0 %  frontend_bound           (42.86%)
                                                  #     26.2 %  backend_bound            (28.57%)

     140.044867984 seconds time elapsed

     140.042186000 seconds user
       0.000000000 seconds sys


# In another window

cr@wls-arm-n2:/home/cr/my_tools/test/high_ipc$ perf stat taskset -c 1 ./high_ipc 7 1000000000
Time elapsed: 140.028971 seconds

 Performance counter stats for 'taskset -c 1 ./high_ipc 7 1000000000':

        140,030.50 msec task-clock                       #    1.000 CPUs utilized
                38      context-switches                 #    0.271 /sec
                 1      cpu-migrations                   #    0.007 /sec
               114      page-faults                      #    0.814 /sec
   384,880,893,997      cycles                           #    2.749 GHz                         (42.86%)
   952,883,898,513      instructions                     #    2.48  insn per cycle              (57.14%)
    50,559,218,606      branches                         #  361.059 M/sec                       (71.43%)
           555,101      branch-misses                    #    0.00% of all branches             (71.43%)
                        TopdownL1                 #      0.0 %  bad_speculation
                                                  #     46.9 %  retiring                 (57.14%)
                                                  #     27.1 %  frontend_bound           (42.86%)
                                                  #     26.0 %  backend_bound            (28.57%)

     140.026846716 seconds time elapsed

     140.031190000 seconds user
       0.000000000 seconds sys

The result shows that IPC for two cores does not have much difference from one core on Arm Neoverse system. And, it does not drop from light load level to heavy load level. This is quite different from SMT system. It means the capability of CPU keeps consistent in all CPU utilization levels.

Total Run Time IPC CPU Usage
Load Level Core 0 Core 1 Core 0 Core 1 Core 0 Core 1
1 - one core 563 2.47 29%
1 - two cores 563 563 2.48 2.47 29% 29%
2 - one core 351 2.48 44%
2 - two cores 351 351 2.47 2.48 43% 43%
3 - one core 267 2.47 56%
3 - two cores 267 267 2.48 2.48 55% 55%
4 - one core 204 2.47 70%
4 - two cores 203 203 2.47 2.47 71% 71%
5 - one core 153 2.47 92%
5 - two cores 152 152 2.47 2.47 92% 92%
6 - one core 149 2.47 94%
6 - two cores 149 149 2.48 2.48 94% 94%
7 - one core 140 2.47 100%
7 - two cores 140 140 2.48 2.48 100% 100%

Further analysis of results

Two Hardware Thread Impact

It is not easy to find a reliable direct metric to tell the busy status of physical cores, but we can tell from other angles. Here we pick IPC to analyze. We use data from 2 cores' run divided by data from 1 core's run. This shows the character of SMT systems. 

Load Level IPC     2 Cores/1 Core
SMT System Non-SMT System
1 0.93 1.00
2 0.93 1.00
3 0.91 1.00
4 0.75 1.00
5 0.54 1.00
6 0.52 1.00
7 0.50 1.00

In the following chart, we see in light load situation, IPC is almost the same for 2 cores and one core for SMT system. This is because physical core is not busy, all operations from logical cores are served almost at the full speed of physical core, which makes the IPC high. And CPU usage looks low. But the physical core is running double the loads. So, the usage of the physical CPU should be doubled. As the load increases, the IPCs for 2 cores decrease dramatically compared to 1 core. Because they need to compete and share the resource from execution engine. The physical core becomes fully utilized that it cannot handle all the requests from two cores in time as in light load situation. The IPC drops to almost half in the end. And this proves the two logical cores are sharing the physical core. 

In contrast, Arm Neoverse system keeps at 1 because the two cores are independent, and they are always run at the same speed as one core no matter what the load level is. 

Figure 2: Micro Benchmark IPC Comparison

Performance vs. CPU usage

Here we define performance as "1000000000/Total Run Time". 1000000000 is the rounds of tests that we did with macro benchmark tool. And since the frequency of the N2 platform we use (2.7Ghz) is much lower than SMT sytem (3.5Ghz), we adjust the performance data of N2 platform to the same 3.5Ghz. We do this for 2 core case which is the normal use case of production environment that uses both logical cores of a physical core. 

CPU Usage Performance Adjusted Performance
SMT System Non-SMT System SMT System Non-SMT System
22% 1923077 1923077
29% 1776199 2302480
34% 3344482 3344482
43% 4694836 2849003 4694836 3693152
55% 3745318 4855042
67% 5988024 5988024
71% 4926108 6385696
92% 6578947 8528265
93% 6622517 6622517
94% 6711409 8699975
95% 6622517 6622517
100% 6802721 7142857 6802721 9259259

Then we got the following chart of Frequency Adjusted Performance Vs CPU Usage. We can see that as CPU usage increases, SMT system performance output becomes flattened. Meanwhile, Arm's performance output keeps a perfect linear increasing trend and finally surpass SMT system in high CPU usage area. So, if you just use performance output in low CPU usage area to predict performance in high CPU usage situation for SMT systems, you may get wrong result. 

Figure 3: Micro Benchmark Adjusted Performance

Here we just created a "Performance Achievement Rate" metric which is "Performance / CPU Usage". Performance data is from the previous table. A perfect system should see a constant ratio for a workload, which means certain CPU usage should give certain performance output. 

CPU Usage Performance Achievement Rate Adjusted Performance Achievement Rate
SMT System Non-SMT System SMT System Non-SMT System
22% 8741259 8741259
29% 6124824 7939587
34% 9836711 9836711
43% 10918223 6625588 10918223 8588725
55% 6809670 8827350
67% 8937349 8937349
71% 6938181 8993938
92% 7151030 9269853
93% 7120986 7120986
94% 7139797 9255293
95% 7007954 7007954
100% 6802721 7142857 6802721 9259259

From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT System. But it behaves much better and constantly on Arm. 

Figure 4: Micro Benchmark Adjusted Performance Achievement Rate

Real workload

Trying to describe the issue with a well-known workload. Here I used Flink and Nexmark, which is used to benchmark Flink. 

Test description

We created two clusters with similar hardware resources for SMT System system and Arm Neoverse N2 base system which does not have SMT. Software version and configurations are also the same. 

CPU

Hardware config
Arm Neoverse N2 System SMT System
Arm Neoverse N2 CPU @ 3.0 GHz SMT Enabled CPU @ 3.5GHz
cluster 1Master + 3Worker Nodes 1Master + 3Worker Nodes
cores 32 cores per node 32 cores per node
memory 128G 128G
flink version 1.17.1 1.17.1
nexmark version 0.2 release 0.2 release

Flink taskmanager config
Arm Neoverse N2 System SMT System
jobmanager.memory.process.size 32G 32G
taskmanager.memory.process.size 13G 13G
taskmanager.numberOfTaskSlots 8 8
parallelism.default 192 192
worker (taskmanager) number 24 workers total (8 workers per node)

Nexmark has several test cases, here we pick Q0 test for the comparison. Other tests may have similar results. 

Test result

Test Result of Nexmark Q0 test is shown below. The CPU usage from top tool for the same load level (TPS) is different for SMT system and Arm. 

TPS (M) CPU Usage
Arm SMT
1.0 5.9% 4.0%
2.0 9.5% 7.2%
4.0 16.5% 13.9%
6.0 23.6% 20.8%
10.0 37.3% 36.6%
14.0 52.7% 56.8%
18.0 70.7% 85.0%
20.0 81.9% 95.3%
21.7 90.4%
23.8 94.2%

We can see with lower workload level, Arm CPU usage observed is higher than SMT system, but after about 50% CPU usage level, SMT system CPU utilization increases much faster. And finally, we see the CPU usage getting much higher than Arm with high CPU load.  And with full CPU usage, Arm generates much higher TPS. 

Figure 5: Flink CPU Usage Under Different TPS

Also, we can see it from another angle: trends of how many TPS it can generate under the same CPU usage level. Arm firstly performs a little worse, but after about 50% CPU usage, Arm becomes much better. 

Figure 6: Flink TPS Under Different CPU Usage

Again, here we created a "Perf Achievement Rate" metric which is TPS/CPU_Usage. A perfect system should see a constant ratio for a workload. From the chart, we see as CPU usage increases, this Perf Achievement Rate drops a lot on SMT system . But it behaves much better on Arm. 

Figure 7: Flink Performance Achievement Rate

So, this test case for Flink shows a similar result as we see in the micro benchmark test. Arm may show higher CPU usage under light loads. However, beyond a certain point, an SMT system's CPU usage can quickly surpass Arm's, resulting in lower performance output under high CPU usage conditions.

Conclusion 

The CPU usage of SMT-enabled systems may be underestimated in light load situations. When estimating how much additional capacity is available for workloads we could deploy on a machine, The CPU utilization metric can be misleading. Instead, we should test performance output level with full CPU usage and then leave a performance buffer based on that value.

Anonymous
Servers and Cloud Computing blog
  • How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

    Peter Ma
    Peter Ma
    Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
    • July 4, 2025
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025