Hello
I’m deploying a convolutional neural network on an ARM Cortex-A78-based mobile platform and using 8-bit post-training quantization to reduce model size. Inference works, but performance is lower than expected — latency per image is around 120 ms, and CPU utilization is high. I’ve tried using ARM Compute Library and Neon intrinsics, but I’m unsure if I’m fully leveraging the CPU’s vectorization capabilities.
Has anyone successfully optimized quantized CNNs on Cortex-A78? Are there recommended compiler flags, threading strategies, or memory layout adjustments that significantly reduce latency? Any practical examples or benchmarks would be extremely helpful.Thank youMed venlig hilsen,Mikkel JensenDenmark
Hi Mikkel,
Could you please share a few more details about the model you're running?
The Arm Compute Library (ACL) includes a number of highly optimized assembly kernels. For an Arm Cortex-A78, I’d recommend compiling with the following options:
arch=armv8.2-a openmp=1 cppthreads=0
Building for arch=armv8.2-a enables the dot-product kernels, which can significantly accelerate quantized convolutions. The OpenMP scheduler generally delivers better performance than cppthreads for most workloads.
ACL also provides several graph-based benchmark examples that can be built using: benchmark_examples=1
You can profile kernel execution times using the scheduler timer: LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark_graph_mobilenet_v2 --instruments=SCHEDULER_TIMER_MS --example_args='--target=NEON,--fast-math'
If you run into any issues or want to discuss further, feel free to open a thread on GitHub: https://github.com/ARM-software/ComputeLibrary/issues
Hope this helps.
Best, Pablo