Hi,
I have compiled my bare metal FSBL software with VFP enabled, and the linker decides to use memcpy with FPU instructions from the libc library.
In particular for copy lengths larger than 64 bytes, FPU-enabled memcpy relies on vstr and vldr operations. For the memcpy without FPU support, it will use ldrd and strd instead.
All four operations have an access granularity of 64bits.
So when I compare the execution time of my FSBL with and without FPU support, it logically ends up with the same result.
Of course Cortex A9 is best for parallelized comparisons, MAC operation, arithmetical on a large dataset... thanks to its 64 extension data registers.
However for memory copy, FPU does not seem to bring any advantage, especially when caches are enabled (FPU/NEON uses L2 cache but not L1 cache).
So why is there a different version of memcpy when FPU is enabled? for which purpose is FPU load/store capacity useful?
Thank you for your hints.
Florian
Thank you Bastien.
So it seems the memcpy from gcc-arm-9.2-2019.12-x86_64-arm-none-eabi does not exploit FPU properly.
Here is an extract of the disassembly I get for memcpy with FPU:
1e2108a4: ed8c0b0a vstr d0, [ip, #40] ; 0x281e2108a8: ed910b0a vldr d0, [r1, #40] ; 0x281e2108ac: ed8c1b0c vstr d1, [ip, #48] ; 0x301e2108b0: ed911b0c vldr d1, [r1, #48] ; 0x301e2108b4: ed8c2b0e vstr d2, [ip, #56] ; 0x381e2108b8: ed912b0e vldr d2, [r1, #56] ; 0x381e2108bc: ed8c4b10 vstr d4, [ip, #64] ; 0x401e2108c0: ed914b10 vldr d4, [r1, #64] ; 0x40
And without FPU:
1e210c84: e1cc22f8 strd r2, [ip, #40] ; 0x281e210c88: e1c122d8 ldrd r2, [r1, #40] ; 0x281e210c8c: e1cc43f0 strd r4, [ip, #48] ; 0x301e210c90: e1c143d0 ldrd r4, [r1, #48] ; 0x301e210c94: e1cc63f8 strd r6, [ip, #56] ; 0x381e210c98: e1c163d8 ldrd r6, [r1, #56] ; 0x381e210c9c: e1ec84f0 strd r8, [ip, #64]! ; 0x401e210ca0: e1e184d0 ldrd r8, [r1, #64]! ; 0x40
Every copy is 64-bit large.
But in case of the FPU, you destroy 4 registers. W/o it is 8 registers. Since r4-r8 are callee saved, you need to push and restore them.
So you need to check the overall "costs" of the FPU vs. non-FPU version.