Hello Ben Clark and the ARM Community:
I've run into an interesting performance issue with ARMNN (running it on Raspberry Pi 4). I am using mobilenet_v1 (alpha depth 0.75) 128 x 128 image classification model. I am using the latest ARMNN library 21.05 cross-compiled for Raspberry Pi, and I am using backends{ "CpuAcc", "CpuRef" }.
I am running a cycle of inferences on a canned image file. In the initialization routine of my application, I initialize the ARMNN framework, allocate output tensors and save a pointer to the ARMNN runtime (armnn::IRuntime* runtime). Then in the inference function, called in a loop, I am retrieving the saved pointer to the runtime and run the inference.
It works fine, but... To my surprise, the very first cycle of inference works significantly faster than the subsequent iterated cycles! With my model, the very first inference takes ~ 77 ms while the subsequent inferences take ~ 125 ms, almost twice as long!
Any idea why?
And the second interesting issue with ARMNN performance: It works significantly slower than inference using TensorFlow Lite inference library. Even very first cycle of inference with ARMNN is almost 2 times slower than using TensorFlow Lite inference library. I didn't expect that...
Hi Ben,
Thanks for your response. You are certainly right about the upsides and downsides of multi-threading, generally speaking. But Arm NN inference engine could set the highest priority to its computing threads -- at least, that's what I would expect. If Arm NN were to be used in [near-] real-time or otherwise time-critical applications, would it not make sense?
But also... I believe TensorFlow Lite inference engine is single-threaded, and yet, it performs so much faster, at least on the mobilenet-based models I've tried so far.
I think I have a lead!
Raspberry Pi 4 has an 64-bit A72 CPU, but from a quick search, default OSes seem to be 32-bit and only detect Armv7 architecture, not Armv8.
ArmNN team tell me the Arm Compute Library is highly optimised for aarch64, but not so much for aarch32 and definitely not for Armv7.
For Armv7, it is not a surprise that some workloads TF Lite will be faster. They are still surprised that second inference is slower than first. It should be the other way around - much more so on GPU, but first should be a little slower on CPU as well.
So first recommendation: check your OS, and probably upgrade it. Ubuntu (19.04+) and Debian have 64-bit Raspberry Pi 4 compatible versions. Then target aarch64, that should give a significant improvement. Is that possible?
Thanks once again for your quick response.
I am very well aware of the OS that runs in Raspberry Pi -- yes, it is armv7, as I myself had mentioned to you in my previous messages. And no, at this point I am not interested in experimenting with aarch64 OS on Raspberry Pi.
My current goal is to evaluate performance of the ARM Compute Library, and by extension the ARM NN inference engine, in 32-bit hardware platforms, both Linux-based and, even more importantly for me, bare-metal. I've been quite disappointed so far, as you can tell. I am surprised that ARM NN team seems to have overlooked the importance of performance optimization for the 32-bit architectures.
Hi,
Arm v7 is >10 years old now (Arm v9 is out!), so the Arm NN team need to focus their resources on more recent technologies. Yes, it's a shame that they're not getting to it, but they can't do everything.
Cheers,
Ben