This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Float vs Neon (with O3)

RajuK over 4 years ago

Hi Experts,

Trying to port People Counting application on ZCU104 platform, where we want to Off load ML Part to FPGA and other Pre/Post processing modules wanted to use ARM CPU Cores. When we run the application we see that Pre/Post processing modules were taking lot of time. So we wanted to implement using Neon Intrinsics .

Here we see issue, when we compiled float and neon code with -O3 flag we see same latency numbers .

Can you please suggests any tips or how to analyse it further on this?

Thanks and Regards,

Raju

0 Ben Clark over 4 years ago

Hi Raju,

You probably need to share a bit more on what you're getting / how you're compiling etc. Also what cpu cores are in the platform.

As a starting point try: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm-neon-optimization which mentions Neon latency issues.

The clang compiler is very mature now for Neon, so a naïve intrinsics implementation is often worse than what the compiler manages. It is important to lay out the data and loops in a way that parallelises easily to maximise the SIMD utilization - this can improve the non-intrinsics performance, as well as allowing you better intrinsics optimization.

Cheers,

Ben
Cancel
Vote up 0 Vote down

Cancel
0 RajuK over 4 years ago in reply to Ben Clark

Hi Ben,

Thanks a lot for your time and reply on this .

I am looking into the optimisation link you have shared here and update you further on this.

how you're compiling?

g++ -O3 decode.cpp -o decode

Decode is a c++ floating point implementation where I have decode the bounding box.

Thanks and Regards,

Raju
Cancel
Vote up 0 Vote down

Cancel
0 Ben Clark over 4 years ago in reply to RajuK

But what are the results that are causing you concern? And what CPU are you targeting/testing on?

Have you got any code snippets to give us more of an idea about the example?
Cancel
Vote up 0 Vote down

Cancel