Subsequent iterated inference cycles work slower than the very first one

Hello Ben Clark and the ARM Community:

I've run into an interesting performance issue with ARMNN (running it on Raspberry Pi 4).  I am using mobilenet_v1 (alpha depth 0.75) 128 x 128 image classification model.  I am using the latest ARMNN library 21.05 cross-compiled for Raspberry Pi, and I am using backends{ "CpuAcc", "CpuRef" }.

I am running a cycle of inferences on a canned image file.  In the initialization routine of my application, I initialize the ARMNN framework, allocate output tensors and save a pointer to the ARMNN runtime (armnn::IRuntime* runtime). Then in the inference function, called in a loop, I am retrieving the saved pointer to the runtime and run the inference.

It works fine, but...  To my surprise, the very first cycle of inference works significantly faster than the subsequent iterated cycles!  With my model, the very first inference takes ~ 77 ms while the subsequent inferences take ~ 125 ms, almost twice as long!

Any idea why?

And the second interesting issue with ARMNN performance:  It works significantly slower than inference using TensorFlow Lite inference library.  Even very first cycle of inference with ARMNN is almost 2 times slower than using TensorFlow Lite inference library.  I didn't expect that...

Parents Reply
  • Hi Ben,

    Here is a link to my presentation at the Embedded Vision Summit 2020:  https://www.edge-ai-vision.com/2021/02/practical-guide-to-implementing-deep-neural-network-inferencing-at-the-edge-a-presentation-from-zebra-technologies/

    Here is another interesting observation about Arm NN:  Disabling cppthreads when building Arm Compute Library, makes Arm NN inference performance much more stable from cycle to cycle, but… yet another very important but… it makes Arm NN inference even more slower!  About ~ 70% slower compared to when cppthreads are enabled.

    Note that all these observations are made on Raspberry Pi 4.  Arm NN and its dependencies where cross-compiled with the latest Linaro g++ compiler for armv7 Linux platforms.  

    The experiments I have described here should be very easy to replicate by anyone who is interested in (and/or concerned about) Arm NN inference performance on variety of ARM hardware platforms.  If you have any specific question about our benchmarking methodology, I will be more than happy to share, but I can assure you that there is really nothing special there.

    I am just looking for some logical explanation for what I am observing:  (1) Why is Arm NN inference performance is so much worse in the subsequent cycles of inference compared to the first cycle and why it’s varying so much when multithreading is enabled; and (2) why disabling multithreading makes Arm NN inference performance much more stable but significantly slower; and finally and probably more importantly (3) why is Arm NN with CpuAcc backend so much slower than TensorFlow Lite?

Children
More questions in this forum
There are no posts to show. This could be because there are no posts in this forum or due to a filter.