Hello to all:
I'm using an i.mx6 with cortex A9 and when I run neural networks (MLP) written by me in plain C++ I got a difference from my I7 of 10 times which seems to be reasonable. My problem appears when I use a CNN (LFFD in my case) with TVM, then the difference between my computer and the i.mx6 if 40 times which is 4 times slower than my own written code. This is what I can't understand, why do I get this extra x4 when using TVM (I have also used TFLite with a similar outcome).
The original format of the file is .onnx and the conversion to TVM is done using optimization without performance loss.
Anyone knows how to overcome this issue?
Thanks in advance.
It is interesting that you are seeing a difference of 4x - perhaps the inefficient implementation is not making use of the Neon instructions?
First of all, thank you for your asnwer.
In relation with the issue, I also thought about it and analized the generated libraries with the readelf commnad and I got the following:
This is exefile which calls the libraries which data I will show you below
this the library with the C++ implementation of the artificial vision project
This is the neural network file converted with TVM to a binary library. This library is called by the previous one.
As you can see, the application and the libraries are enabled to use NEONv1. So, I assume that this is not the problem (unless I'm wrong because I'm far from being an expert on ARM platforms).
In fact, the neural networks are converted targeting the Cortex A9 with the following parameters:
May be you have another clue ...
I would really appreciate any feedback...
I did't comment it, but the inefficient network implmentation is generated with MXNET to onnx (works fine in Windows under Opencv) and then it is optimized with TVM to work on ARM9.
I have exactly the same problem. Anyone got a suggestion?