This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

FPU vs CPU load/store performance

Hi,

I have compiled my bare metal FSBL software with VFP enabled, and the linker decides to use memcpy with FPU instructions from the libc library.

In particular for copy lengths larger than 64 bytes, FPU-enabled memcpy relies on vstr and vldr operations. For the memcpy without FPU support, it will use ldrd and strd instead.

All four operations have an access granularity of 64bits.

So when I compare the execution time of my FSBL with and without FPU support, it logically ends up with the same result.

Of course Cortex A9 is best for parallelized comparisons, MAC operation, arithmetical on a large dataset... thanks to its 64 extension data registers.

However for memory copy, FPU does not seem to bring any advantage, especially when caches are enabled (FPU/NEON uses L2 cache but not L1 cache).

So why is there a different version of memcpy when FPU is enabled? for which purpose is FPU load/store capacity useful?

Thank you for your hints.

Florian

Parents

0 42Bastian Schick over 5 years ago

Load or store 128bit with a single instruction.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 42Bastian Schick over 5 years ago

Load or store 128bit with a single instruction.
Cancel
Vote up 0 Vote down

Cancel

Children

0 flongnos over 5 years ago in reply to 42Bastian Schick

Thank you Bastien.

So it seems the memcpy from gcc-arm-9.2-2019.12-x86_64-arm-none-eabi does not exploit FPU properly.

Here is an extract of the disassembly I get for memcpy with FPU:

1e2108a4:   ed8c0b0a    vstr   d0, [ip, #40]   ; 0x28
1e2108a8:   ed910b0a    vldr   d0, [r1, #40]   ; 0x28
1e2108ac:   ed8c1b0c    vstr   d1, [ip, #48]   ; 0x30
1e2108b0:   ed911b0c    vldr   d1, [r1, #48]   ; 0x30
1e2108b4:   ed8c2b0e    vstr   d2, [ip, #56]   ; 0x38
1e2108b8:   ed912b0e    vldr   d2, [r1, #56]   ; 0x38
1e2108bc:   ed8c4b10    vstr   d4, [ip, #64]   ; 0x40
1e2108c0:   ed914b10    vldr   d4, [r1, #64]   ; 0x40

And without FPU:

1e210c84:   e1cc22f8    strd   r2, [ip, #40]   ; 0x28
1e210c88:   e1c122d8    ldrd   r2, [r1, #40]   ; 0x28
1e210c8c:   e1cc43f0    strd   r4, [ip, #48]   ; 0x30
1e210c90:   e1c143d0    ldrd   r4, [r1, #48]   ; 0x30
1e210c94:   e1cc63f8    strd   r6, [ip, #56]   ; 0x38
1e210c98:   e1c163d8    ldrd   r6, [r1, #56]   ; 0x38
1e210c9c:   e1ec84f0    strd   r8, [ip, #64]!   ; 0x40
1e210ca0:   e1e184d0    ldrd   r8, [r1, #64]!   ; 0x40

Every copy is 64-bit large.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 5 years ago in reply to flongnos

But in case of the FPU, you destroy 4 registers. W/o it is 8 registers. Since r4-r8 are callee saved, you need to push and restore them.

So you need to check the overall "costs" of the FPU vs. non-FPU version.
Cancel
Vote up 0 Vote down

Cancel