This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Subsequent iterated inference cycles work slower than the very first one

Hello Ben Clark and the ARM Community:

I've run into an interesting performance issue with ARMNN (running it on Raspberry Pi 4).  I am using mobilenet_v1 (alpha depth 0.75) 128 x 128 image classification model.  I am using the latest ARMNN library 21.05 cross-compiled for Raspberry Pi, and I am using backends{ "CpuAcc", "CpuRef" }.

I am running a cycle of inferences on a canned image file.  In the initialization routine of my application, I initialize the ARMNN framework, allocate output tensors and save a pointer to the ARMNN runtime (armnn::IRuntime* runtime). Then in the inference function, called in a loop, I am retrieving the saved pointer to the runtime and run the inference.

It works fine, but...  To my surprise, the very first cycle of inference works significantly faster than the subsequent iterated cycles!  With my model, the very first inference takes ~ 77 ms while the subsequent inferences take ~ 125 ms, almost twice as long!

Any idea why?

And the second interesting issue with ARMNN performance:  It works significantly slower than inference using TensorFlow Lite inference library.  Even very first cycle of inference with ARMNN is almost 2 times slower than using TensorFlow Lite inference library.  I didn't expect that...

Parents Reply Children
  • I can't publicly share the source code of our application, sorry.  This application is a benchmarking application that we use to evaluate performance of various inference engines at the edge, such as aforementioned TensorFlow Lite, as well as TensorRT, EdgeTPU, etc.  This application was a part of my presentation at the Embedded Vision Summit last year.  I just integrated the ARMNN inference engine in this application.  But this loop should be really easy to replicate.

    I am quite surprised that this benchmarking doesn't seem to have been done before me?  I am particularly surprised because ARM promotes its ARMNN delegate for TensorFlow Lite.  Does it make sense if ARMNN works actually slower than TensorFlow Lite itself?  I don't get it, do you?

  • Is there anything you could share to developer at arm dot com? 

    Out of interest which was your presentation at Embedded Vision? & what's the benchmarking application? This will all be very interesting to us! Pavel has done some work looking at ArmNN and TFLite performance, so might have some tips for you if we can work out how things are being called etc.  Subsequent inferences shouldn't take longer than first inferences, so sounds like something's amiss at any rate.

  • Hi Ben,

    Here is a link to my presentation at the Embedded Vision Summit 2020:  https://www.edge-ai-vision.com/2021/02/practical-guide-to-implementing-deep-neural-network-inferencing-at-the-edge-a-presentation-from-zebra-technologies/

    Here is another interesting observation about Arm NN:  Disabling cppthreads when building Arm Compute Library, makes Arm NN inference performance much more stable from cycle to cycle, but… yet another very important but… it makes Arm NN inference even more slower!  About ~ 70% slower compared to when cppthreads are enabled.

    Note that all these observations are made on Raspberry Pi 4.  Arm NN and its dependencies where cross-compiled with the latest Linaro g++ compiler for armv7 Linux platforms.  

    The experiments I have described here should be very easy to replicate by anyone who is interested in (and/or concerned about) Arm NN inference performance on variety of ARM hardware platforms.  If you have any specific question about our benchmarking methodology, I will be more than happy to share, but I can assure you that there is really nothing special there.

    I am just looking for some logical explanation for what I am observing:  (1) Why is Arm NN inference performance is so much worse in the subsequent cycles of inference compared to the first cycle and why it’s varying so much when multithreading is enabled; and (2) why disabling multithreading makes Arm NN inference performance much more stable but significantly slower; and finally and probably more importantly (3) why is Arm NN with CpuAcc backend so much slower than TensorFlow Lite?

  • Hi,

    I can answer the middle of the 3 questions, but I'm going to have to get the ArmNN team to weigh in on the others.

    If a process is single-threaded it can generally deliver much more consistent, slower performance, as it doesn't have to contend with other threads for control in the same way, but also is only utilising one core.  Multi-threading will enable better usage of the processor as a whole, but switching & contention between threads can mean performance isn't as consistent, as it's a matter of when which thread gets which core.

    NN inference can slow down over time due to thermal throttling, but that wouldn't be by the second cycle, which confuses me.  So I'm asking our ArmNN team for some more ideas about what could be going wrong.

  • Hi Ben,

    Thanks for your response.  You are certainly right about the upsides and downsides of multi-threading, generally speaking.  But Arm NN inference engine could set the highest priority to its computing threads -- at least, that's what I would expect. If Arm NN were to be used in [near-] real-time or otherwise time-critical applications, would it not make sense?

    But also...  I believe TensorFlow Lite inference engine is single-threaded, and yet, it performs so much faster, at least on the mobilenet-based models I've tried so far.

  • I think I have a lead!

    Raspberry Pi 4 has an 64-bit A72 CPU, but from a quick search, default OSes seem to be 32-bit and only detect Armv7 architecture, not Armv8.

    ArmNN team tell me the Arm Compute Library is highly optimised for aarch64, but not so much for aarch32 and definitely not for Armv7.

    For Armv7, it is not a surprise that some workloads TF Lite will be faster.  They are still surprised that second inference is slower than first. It should be the other way around - much more so on GPU, but first should be a little slower on CPU as well.

    So first recommendation: check your OS, and probably upgrade it.  Ubuntu (19.04+) and Debian have 64-bit Raspberry Pi 4 compatible versions.  Then target aarch64, that should give a significant improvement.  Is that possible?

  • Hi Ben,

    Thanks once again for your quick response.

    I am very well aware of the OS that runs in Raspberry Pi -- yes, it is armv7, as I myself had mentioned to you in my previous messages.  And no, at this point I am not interested in experimenting with aarch64 OS on Raspberry Pi.

    My current goal is to evaluate performance of the ARM Compute Library, and by extension the ARM NN inference engine, in 32-bit hardware platforms, both Linux-based and, even more importantly for me, bare-metal. I've been quite disappointed so far, as you can tell.  I am surprised that ARM NN team seems to have overlooked the importance of performance optimization for the 32-bit architectures.

  • Hi,

    Arm v7 is >10 years old now (Arm v9 is out!), so the Arm NN team need to focus their resources on more recent technologies. Yes, it's a shame that they're not getting to it, but they can't do everything.

    Cheers,

    Ben