This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Float vs Neon (with O3)

Hi Experts,

Trying to port People Counting application on ZCU104 platform, where we want to Off load ML Part to FPGA and other Pre/Post processing modules wanted to use ARM CPU Cores. When we run the application we see that Pre/Post processing modules were taking lot of time. So we wanted to implement using Neon Intrinsics .

Here we see issue, when we compiled float and neon code with -O3 flag we see same latency numbers .

Can you please suggests any tips or how to analyse it further on this?

Thanks and Regards,

Raju

Parents

0 Ben Clark over 3 years ago

Hi Raju,

You probably need to share a bit more on what you're getting / how you're compiling etc. Also what cpu cores are in the platform.

As a starting point try: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm-neon-optimization which mentions Neon latency issues.

The clang compiler is very mature now for Neon, so a naïve intrinsics implementation is often worse than what the compiler manages. It is important to lay out the data and loops in a way that parallelises easily to maximise the SIMD utilization - this can improve the non-intrinsics performance, as well as allowing you better intrinsics optimization.

Cheers,

Ben
Cancel
Up 0 Down

Cancel

Reply

0 Ben Clark over 3 years ago

Hi Raju,

You probably need to share a bit more on what you're getting / how you're compiling etc. Also what cpu cores are in the platform.

As a starting point try: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm-neon-optimization which mentions Neon latency issues.

The clang compiler is very mature now for Neon, so a naïve intrinsics implementation is often worse than what the compiler manages. It is important to lay out the data and loops in a way that parallelises easily to maximise the SIMD utilization - this can improve the non-intrinsics performance, as well as allowing you better intrinsics optimization.

Cheers,

Ben
Cancel
Up 0 Down

Cancel

Children

0 RajuK over 3 years ago in reply to Ben Clark

Hi Ben,

Thanks a lot for your time and reply on this .

I am looking into the optimisation link you have shared here and update you further on this.

how you're compiling?

g++ -O3 decode.cpp -o decode

Decode is a c++ floating point implementation where I have decode the bounding box.

Thanks and Regards,

Raju
Cancel
Up 0 Down

Cancel
0 Ben Clark over 3 years ago in reply to RajuK

But what are the results that are causing you concern? And what CPU are you targeting/testing on?

Have you got any code snippets to give us more of an idea about the example?
Cancel
Up 0 Down

Cancel