Improving Inference Performance of a Quantized CNN on Cortex-A78

Hello

I’m deploying a convolutional neural network on an ARM Cortex-A78-based mobile platform and using 8-bit post-training quantization to reduce model size. Inference works, but performance is lower than expected — latency per image is around 120 ms, and CPU utilization is high. I’ve tried using ARM Compute Library and Neon intrinsics, but I’m unsure if I’m fully leveraging the CPU’s vectorization capabilities.

Has anyone successfully optimized quantized CNNs on Cortex-A78? Are there recommended compiler flags, threading strategies, or memory layout adjustments that significantly reduce latency? Any practical examples or benchmarks would be extremely helpful.

Thank you

Med venlig hilsen,
Mikkel Jensen
Denmark