Improving Inference Performance of a Quantized CNN on Cortex-A78

Hello

I’m deploying a convolutional neural network on an ARM Cortex-A78-based mobile platform and using 8-bit post-training quantization to reduce model size. Inference works, but performance is lower than expected — latency per image is around 120 ms, and CPU utilization is high. I’ve tried using ARM Compute Library and Neon intrinsics, but I’m unsure if I’m fully leveraging the CPU’s vectorization capabilities.

Has anyone successfully optimized quantized CNNs on Cortex-A78? Are there recommended compiler flags, threading strategies, or memory layout adjustments that significantly reduce latency? Any practical examples or benchmarks would be extremely helpful.

Thank you

Med venlig hilsen,
Mikkel Jensen
Denmark

0 Ben Clark 3 months ago

Hi Mikkel,

We have optimized/quantized for cortex-A including A78. We probably need a bit more info to know what the issue is / what can be done. Is it Dynamic or static quantization? Do you know layer timings at all - is it the convolution layers or another layer that isn't getting sped up? What was speed before quantization? Any pruning? What size images? etc etc. But I'll point a couple of ACL people at this thread as well, they'll have better specifics than me if you have more detail. If there's parts you cannot share one a public forum we can provide an email address etc. Or you can ask more on Arm Developer discord if you join the Arm Developer program too.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel
0 Pablo Márquez 3 months ago

Hi Mikkel,

Could you please share a few more details about the model you're running?

The Arm Compute Library (ACL) includes a number of highly optimized assembly kernels. For an Arm Cortex-A78, I’d recommend compiling with the following options:

arch=armv8.2-a openmp=1 cppthreads=0

Building for arch=armv8.2-a enables the dot-product kernels, which can significantly accelerate quantized convolutions. The OpenMP scheduler generally delivers better performance than cppthreads for most workloads.

ACL also provides several graph-based benchmark examples that can be built using:
benchmark_examples=1

You can profile kernel execution times using the scheduler timer:
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'

If you run into any issues or want to discuss further, feel free to open a thread on GitHub:
https://github.com/ARM-software/ComputeLibrary/issues

Hope this helps.

Best,
Pablo
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel