Float vs Neon (with O3)

Trying to port People Counting application on ZCU104 platform, where we want to Off load ML Part to FPGA and other Pre/Post processing modules wanted to use ARM CPU Cores. When we run the application we see that Pre/Post processing modules were taking lot of time. So we wanted to implement using Neon Intrinsics  .

Here we see issue, when we compiled float and neon code with -O3 flag we see same latency numbers .

Can you please suggests any tips or how to analyse it further on this?

