Analyzing Machine Learning models on a layer-by-layer basis

October 31, 2022

7 minute read time.

Overview

When you are deploying a Machine Learning model, you may want to know how well your neural network is using the capabilities of the hardware during inference. If you target an Arm Ethos-U55 or Ethos-U65 Machine Learning accelerator, you have to optimize your model with the Vela compiler. This blog explains how to analyze a neural network on a layer-by-layer basis and builds on top of the blog post explaining how to use Vela. Let us define two example deep learning models and analyze both of them on a layer-by-layer basis.

Model definition

Consider a Machine Learning model that takes as input a 284x284 image with 3 channels. The model has four convolutions followed by a fully connected layer in the end.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import tensorflow as tf
import numpy as np
 
def define_model(num_chan):
        model = tf.keras.Sequential()
        model.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
                 input_shape=(284, 284, 3)))
        model.add(tf.keras.layers.Conv2D(64,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Conv2D(num_chan,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Conv2D(32,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Flatten())
        model.add(tf.keras.layers.Dense(10, activation='softmax'))
        print(model.summary())
        return model
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

import tensorflow as tf
import numpy as np
 
def define_model(num_chan):
        model = tf.keras.Sequential()
        model.add(tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
                 input_shape=(284, 284, 3)))
        model.add(tf.keras.layers.Conv2D(64,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Conv2D(num_chan,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Conv2D(32,(3,3), activation='relu',strides=2))
        model.add(tf.keras.layers.Flatten())
        model.add(tf.keras.layers.Dense(10, activation='softmax'))
        print(model.summary())
        return model

This is a simplified version of a model that can be doing image classification and the reasoning relative to per-layer profiling can be applied to any other model as well.

We are using two variants of the same model. In one case we have 90 channels in the third convolution and in the other case - 96 channels on the same convolution. The Ethos-U works on quantized input, so we will also use the TFLite converter to quantize the two networks to int8.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
model1 = define_model(90)
model2 = define_model(96)
def rep_dataset():
    for i in range(50):
        img =  np.random.rand(1,284,284,3)
        yield [img.astype(np.float32)]
 
# Model 1
converter_quant = tf.lite.TFLiteConverter.from_keras_model(model1)
converter_quant.optimizations = [tf.lite.Optimize.DEFAULT]
converter_quant.representative_dataset = rep_dataset
converter_quant.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_quant.inference_input_type = tf.int8
converter_quant.inference_output_type = tf.int8
tflite_model = converter_quant.convert()
open('Model1.tflite', 'wb').write(tflite_model)
 
# Model 2
converter_quant2 = tf.lite.TFLiteConverter.from_keras_model(model2)
converter_quant2.optimizations = [tf.lite.Optimize.DEFAULT]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

model1 = define_model(90)
model2 = define_model(96)

def rep_dataset():
    for i in range(50):
        img =  np.random.rand(1,284,284,3)
        yield [img.astype(np.float32)]
 
# Model 1
converter_quant = tf.lite.TFLiteConverter.from_keras_model(model1)
converter_quant.optimizations = [tf.lite.Optimize.DEFAULT]
converter_quant.representative_dataset = rep_dataset
converter_quant.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_quant.inference_input_type = tf.int8
converter_quant.inference_output_type = tf.int8
tflite_model = converter_quant.convert()
open('Model1.tflite', 'wb').write(tflite_model)
 
# Model 2
converter_quant2 = tf.lite.TFLiteConverter.from_keras_model(model2)
converter_quant2.optimizations = [tf.lite.Optimize.DEFAULT]
converter_quant2.representative_dataset = rep_dataset
converter_quant2.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_quant2.inference_input_type = tf.int8
converter_quant2.inference_output_type = tf.int8
tflite_model = converter_quant2.convert()
open('Model2.tflite', 'wb').write(tflite_model)

Below you can see a visualization of Model1 after it has been quantized.

Remainder that the biggest optimization you can make to your model is to design the network with the operators supported by the Vela compiler. In this way, the microNPU accelerates the integrity of your workload. When your model maps fully to the Ethos-U, you can use Vela's per-layer estimate to find additional bottlenecks in the workload. Both Model1 and Model2 map fully to the Ethos-U(in other words, there are no fallbacks to the CPU), so let us compile them with per-layer estimation.

Compile a network with per-layer estimate

To analyze a network on a per-layer basis, you would need to include the --verbose-performance CLI option to your Vela command during compilation. Let us compile Model1 from the previous paragraph:

Fullscreen

1
vela Model1.tflite --accelerator-config=ethos-u65-256 --config vela.ini --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --verbose-performance
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

vela Model1.tflite --accelerator-config=ethos-u65-256 --config vela.ini --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --verbose-performance

With the previous command, Vela will map the operators of the Model1.tflite to the block-based MAC engines of an Ethos-U65 configured with 256 Multiply Accumulates per clock cycle and you obtain a per-layer estimate report. Note that Vela does not have a cycle accurate model of the microNPU under the hood. So, the numbers reported from the compiler are an estimate and they will not match exactly the performance when running your workload in silicon. Having said that, you can still use the report to identify the most compute-intensive layers and understand if your model is making good use of the capabilities of the hardware.

Now, let us examine the per-layer report you obtain:

Fullscreen

1
2
3
4
5
6
7
Performance for NPU Subgraph main_split_1
 TFLite Operator:     NNG Operator:        SRAM Usage (  Peak%): Op Cycles (Netwrk%) [        NPU    SRAM AC    DRAM AC OnFlash AC OffFlashAC ]: MAC Count (Netwrk% /   Util%):Name:
 CONV_2D              Conv2DBias               326336 (100.00%)    2687386 ( 43.91%) [    1387150     159048    2687386          0          0 ]   68708736 (  9.72% /   9.99%) sequential/conv2d/Relu;sequential/conv2d/BiasAdd;sequential/conv2d_3/BiasAdd/ReadVariableOp;sequential/conv2d/Conv2D
 CONV_2D              Conv2DBias               326336 (100.00%)    1489121 ( 24.33%) [    1489121     419904     334507          0          0 ]  361267200 ( 51.09% /  94.77%) sequential/conv2d_1/Relu;sequential/conv2d_1/BiasAdd;sequential/conv2d_1/BiasAdd/ReadVariableOp;sequential/conv2d_1/Conv2D
 CONV_2D              Conv2DBias               155280 ( 47.58%)    1566720 ( 25.60%) [    1230143      28566    1566720          0          0 ]  246810240 ( 34.90% /  61.54%) sequential/conv2d_2/Relu;sequential/conv2d_2/BiasAdd;sequential/conv2d_2/BiasAdd/ReadVariableOp;sequential/conv2d_2/Conv2D
 CONV_2D              Conv2DBias               155280 ( 47.58%)     215668 (  3.52%) [     215668      75888      78916          0          0 ]   29963520 (  4.24% /  54.27%) sequential/conv2d_3/Relu;sequential/conv2d_3/BiasAdd;sequential/conv2d_3/BiasAdd/ReadVariableOp;sequential/conv2d_3/Conv2D
 FULLY_CONNECTED      FullyConnected                0 (  0.00%)     157491 (  2.57%) [      92730          5     157491          0          0 ]     369920 (  0.05% /   0.92%) sequential/dense/MatMul;sequential/dense/BiasAdd
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Performance for NPU Subgraph main_split_1
 TFLite Operator:     NNG Operator:        SRAM Usage (  Peak%): Op Cycles (Netwrk%) [        NPU    SRAM AC    DRAM AC OnFlash AC OffFlashAC ]: MAC Count (Netwrk% /   Util%):Name:
 CONV_2D              Conv2DBias               326336 (100.00%)    2687386 ( 43.91%) [    1387150     159048    2687386          0          0 ]   68708736 (  9.72% /   9.99%) sequential/conv2d/Relu;sequential/conv2d/BiasAdd;sequential/conv2d_3/BiasAdd/ReadVariableOp;sequential/conv2d/Conv2D
 CONV_2D              Conv2DBias               326336 (100.00%)    1489121 ( 24.33%) [    1489121     419904     334507          0          0 ]  361267200 ( 51.09% /  94.77%) sequential/conv2d_1/Relu;sequential/conv2d_1/BiasAdd;sequential/conv2d_1/BiasAdd/ReadVariableOp;sequential/conv2d_1/Conv2D
 CONV_2D              Conv2DBias               155280 ( 47.58%)    1566720 ( 25.60%) [    1230143      28566    1566720          0          0 ]  246810240 ( 34.90% /  61.54%) sequential/conv2d_2/Relu;sequential/conv2d_2/BiasAdd;sequential/conv2d_2/BiasAdd/ReadVariableOp;sequential/conv2d_2/Conv2D
 CONV_2D              Conv2DBias               155280 ( 47.58%)     215668 (  3.52%) [     215668      75888      78916          0          0 ]   29963520 (  4.24% /  54.27%) sequential/conv2d_3/Relu;sequential/conv2d_3/BiasAdd;sequential/conv2d_3/BiasAdd/ReadVariableOp;sequential/conv2d_3/Conv2D
 FULLY_CONNECTED      FullyConnected                0 (  0.00%)     157491 (  2.57%) [      92730          5     157491          0          0 ]     369920 (  0.05% /   0.92%) sequential/dense/MatMul;sequential/dense/BiasAdd

What is the meaning of each column in the report?

NNG Operator	Neural network graph operator - the operator that is run on the hardware
SRAM Usage	Amount of SRAM to run the operator
Peak%	Percentage of the peak SRAM usage for the whole network
Op cycles	Number of estimated cycles to run the operator
Netrwk%	Percentage of Op cycles / Total Op cycle estimate
NPU	Estimated number of NPU cycles
SRAM AC	Estimated number of SRAM access cycles
DRAM AC	Estimated number of DRAM access cycles
MAC Count	Number of multiply accumulates to perform the operation
Netwrk%	Percentage of MAC Count/Total number of MACs of the NN
Util%	Percentage of MAC Count/MACs if the operation used the full capability of the hardware

Analyzing a neural network on a layer-by-layer basis

First, let us focus on the per-layer report relative to Model 1.

The first convolution is estimated to consume 43% of all cycles and is having MAC utilization of nearly 10%. This is the layer that is expected to consume the biggest number of cycles from the whole network. The Ethos-U65 hardware with 256 MAC units works efficiently when the number of output channels is a multiple of 8, as illustrated in the OFM depth column here. Note that other variants of the microNPU work efficiently with different values for the number of output channels. For example, an Ethos-U65 with 512 MAC units configured with PARALLEL_MODE = 1 would work efficiently when the number of output channels is a multiple of 16. If we were to modify the number of channels on the first convolution, the utilization of the hardware would improve. However, the first input is an image with 3 channels and we cannot change the input data to something else.

The third convolution is estimated to consume 25% of all cycles and has utilization of 61%. As per the model definition, the third convolution is with 90 channels. The EthosU65-256 hardware works efficiently when the number of channels in a layer is a multiple of 8. In the next paragraph we will change the number of channels on that convolution to 96 and analyze the impact.

Furthermore, note that as per the Util% column relative to MAC count, the Fully Connected layer has a low efficiency - less than 1%. That is expected because Fully Connected is highly memory bound operation. As the table shows, the expected number of DRAM access cycles is far superior to the number of NPU cycles. This is because you read each weight only once and modern neural network architectures rarely rely on Fully Connected layers. The Netwrk% column(relative to Op cycles) shows that the Fully Connected is estimated to be responsible for only 2% of all cycles of the inference. So, even if we somehow optimized the Fully Connected operation to make better use of the hardware, the net improvement will be minimal.

Note also that Model1 has a RESHAPE layer before it was compiled, but the per-layer table does not contain RESHAPE. This behavior is expected - Vela optimizes the memory-only operations such as RESHAPE by modifying the shape of the IFM/OFM of preceding or following tensors.

Now, let us pass to the second model. We compile it with the following command.

Fullscreen

1
vela Model2.tflite --accelerator-config=ethos-u65-256 --config vela.ini --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --verbose-performance
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

vela Model2.tflite --accelerator-config=ethos-u65-256 --config vela.ini --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --verbose-performance

And obtain the per-layer report:

Fullscreen

1
2
3
4
5
6
7
Performance for NPU Subgraph main_split_1
 TFLite Operator:     NNG Operator:        SRAM Usage (  Peak%): Op Cycles (Netwrk%) [        NPU    SRAM AC    DRAM AC OnFlash AC OffFlashAC ]: MAC Count (Netwrk% /   Util%):Name:
 CONV_2D              Conv2DBias               326352 (100.00%)    2687386 ( 43.91%) [    1387150     159048    2687386          0          0 ]   68708736 (  9.47% /   9.99%) sequential_1/conv2d_4/Relu;sequential_1/conv2d_4/BiasAdd;sequential_1/conv2d_4/BiasAdd/ReadVariableOp;sequential_1/conv2d_4/Conv2D
 CONV_2D              Conv2DBias               326352 (100.00%)    1489121 ( 24.33%) [    1489121     419904     334507          0          0 ]  361267200 ( 49.79% /  94.77%) sequential_1/conv2d_5/Relu;sequential_1/conv2d_5/BiasAdd;sequential_1/conv2d_5/BiasAdd/ReadVariableOp;sequential_1/conv2d_5/Conv2D
 CONV_2D              Conv2DBias               150496 ( 46.11%)    1566720 ( 25.60%) [    1230984      28566    1566720          0          0 ]  263264256 ( 36.28% /  65.64%) sequential_1/conv2d_6/Relu;sequential_1/conv2d_6/BiasAdd;sequential_1/conv2d_6/BiasAdd/ReadVariableOp;sequential_1/conv2d_6/Conv2D
 CONV_2D              Conv2DBias               150496 ( 46.11%)     216103 (  3.53%) [     216103      75888      78916          0          0 ]   31961088 (  4.40% /  57.77%) sequential_1/conv2d_7/Relu;sequential_1/conv2d_7/BiasAdd;sequential_1/conv2d_4/BiasAdd/ReadVariableOp;sequential_1/conv2d_7/Conv2D
 FULLY_CONNECTED      FullyConnected                0 (  0.00%)     157478 (  2.57%) [      92730          5     157478          0          0 ]     369920 (  0.05% /   0.92%) sequential_1/dense_1/MatMul;sequential_1/dense_1/BiasAdd
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Performance for NPU Subgraph main_split_1
 TFLite Operator:     NNG Operator:        SRAM Usage (  Peak%): Op Cycles (Netwrk%) [        NPU    SRAM AC    DRAM AC OnFlash AC OffFlashAC ]: MAC Count (Netwrk% /   Util%):Name:
 CONV_2D              Conv2DBias               326352 (100.00%)    2687386 ( 43.91%) [    1387150     159048    2687386          0          0 ]   68708736 (  9.47% /   9.99%) sequential_1/conv2d_4/Relu;sequential_1/conv2d_4/BiasAdd;sequential_1/conv2d_4/BiasAdd/ReadVariableOp;sequential_1/conv2d_4/Conv2D
 CONV_2D              Conv2DBias               326352 (100.00%)    1489121 ( 24.33%) [    1489121     419904     334507          0          0 ]  361267200 ( 49.79% /  94.77%) sequential_1/conv2d_5/Relu;sequential_1/conv2d_5/BiasAdd;sequential_1/conv2d_5/BiasAdd/ReadVariableOp;sequential_1/conv2d_5/Conv2D
 CONV_2D              Conv2DBias               150496 ( 46.11%)    1566720 ( 25.60%) [    1230984      28566    1566720          0          0 ]  263264256 ( 36.28% /  65.64%) sequential_1/conv2d_6/Relu;sequential_1/conv2d_6/BiasAdd;sequential_1/conv2d_6/BiasAdd/ReadVariableOp;sequential_1/conv2d_6/Conv2D
 CONV_2D              Conv2DBias               150496 ( 46.11%)     216103 (  3.53%) [     216103      75888      78916          0          0 ]   31961088 (  4.40% /  57.77%) sequential_1/conv2d_7/Relu;sequential_1/conv2d_7/BiasAdd;sequential_1/conv2d_4/BiasAdd/ReadVariableOp;sequential_1/conv2d_7/Conv2D
 FULLY_CONNECTED      FullyConnected                0 (  0.00%)     157478 (  2.57%) [      92730          5     157478          0          0 ]     369920 (  0.05% /   0.92%) sequential_1/dense_1/MatMul;sequential_1/dense_1/BiasAdd

When using 96 channels, we improve the efficiency of the third convolution to 65%. What is more, the estimated cycle count for the third convolution has remained unchanged despite increasing the number of channels of the model. Why is that ? The convolution dimensions need to get rounded up to fit the 256 MACs of the microNPU and when using 90 and 96 channels the dimensions get rounded up to the same closest multiple of 8. In other words, we are able to process more data on the Ethos-U65-256 hardware with minimal penalty on the total cycle count. Equally, as the output from the third convolution is fed into the last convolution, we also improve the efficiency of the fourth convolution. More generally, you can use the convolution performance table to check the optimal shape of the Feature Maps of your model.

Conclusion

In this blog, we analyzed two deep learning models on a layer-by-layer basis. We identified the layer that takes the biggest number of cycles. We also made a change to the model to improve the overall use of the hardware. Vela's per-layer report is a powerful tool to analyze the MAC engine utilization of every layer of a neural network. Once your model maps fully to the Ethos-U, you can achieve a fine-grained understanding of the bottlenecks of your workload and make changes according to the specificities of your model.

0 comments
0 members are here

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Analyzing Machine Learning models on a layer-by-layer basis

Overview

Model definition

Compile a network with per-layer estimate

Analyzing a neural network on a layer-by-layer basis

Conclusion

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant