This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

FPU vs CPU load/store performance

Hi,

I have compiled my bare metal FSBL software with VFP enabled, and the linker decides to use memcpy with FPU instructions from the libc library.

In particular for copy lengths larger than 64 bytes, FPU-enabled memcpy relies on vstr and vldr operations. For the memcpy without FPU support, it will use ldrd and strd instead.

All four operations have an access granularity of 64bits.

So when I compare the execution time of my FSBL with and without FPU support, it logically ends up with the same result.

Of course Cortex A9 is best for parallelized comparisons, MAC operation, arithmetical on a large dataset... thanks to its 64 extension data registers.

However for memory copy, FPU does not seem to bring any advantage, especially when caches are enabled (FPU/NEON uses L2 cache but not L1 cache). 

So why is there a different version of memcpy when FPU is enabled? for which purpose is FPU load/store capacity useful?

Thank you for your hints.

Florian

Parents Reply Children