Hi Experts,
Trying to port People Counting application on ZCU104 platform, where we want to Off load ML Part to FPGA and other Pre/Post processing modules wanted to use ARM CPU Cores. When we run the application we see that Pre/Post processing modules were taking lot of time. So we wanted to implement using Neon Intrinsics .
Here we see issue, when we compiled float and neon code with -O3 flag we see same latency numbers .
Can you please suggests any tips or how to analyse it further on this?
Thanks and Regards,
Raju
Hi Ben,
Thanks a lot for your time and reply on this .
I am looking into the optimisation link you have shared here and update you further on this.
how you're compiling?
g++ -O3 decode.cpp -o decode
Decode is a c++ floating point implementation where I have decode the bounding box.
But what are the results that are causing you concern? And what CPU are you targeting/testing on?
Have you got any code snippets to give us more of an idea about the example?