This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

FPU vs CPU load/store performance

Hi,

I have compiled my bare metal FSBL software with VFP enabled, and the linker decides to use memcpy with FPU instructions from the libc library.

In particular for copy lengths larger than 64 bytes, FPU-enabled memcpy relies on vstr and vldr operations. For the memcpy without FPU support, it will use ldrd and strd instead.

All four operations have an access granularity of 64bits.

So when I compare the execution time of my FSBL with and without FPU support, it logically ends up with the same result.

Of course Cortex A9 is best for parallelized comparisons, MAC operation, arithmetical on a large dataset... thanks to its 64 extension data registers.

However for memory copy, FPU does not seem to bring any advantage, especially when caches are enabled (FPU/NEON uses L2 cache but not L1 cache).

So why is there a different version of memcpy when FPU is enabled? for which purpose is FPU load/store capacity useful?

Thank you for your hints.

Florian

Parents

0 42Bastian Schick over 5 years ago in reply to flongnos

But in case of the FPU, you destroy 4 registers. W/o it is 8 registers. Since r4-r8 are callee saved, you need to push and restore them.

So you need to check the overall "costs" of the FPU vs. non-FPU version.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 42Bastian Schick over 5 years ago in reply to flongnos

But in case of the FPU, you destroy 4 registers. W/o it is 8 registers. Since r4-r8 are callee saved, you need to push and restore them.

So you need to check the overall "costs" of the FPU vs. non-FPU version.
Cancel
Vote up 0 Vote down

Cancel

Children

No data