This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

FPU vs CPU load/store performance

Hi,

I have compiled my bare metal FSBL software with VFP enabled, and the linker decides to use memcpy with FPU instructions from the libc library.

In particular for copy lengths larger than 64 bytes, FPU-enabled memcpy relies on vstr and vldr operations. For the memcpy without FPU support, it will use ldrd and strd instead.

All four operations have an access granularity of 64bits.

So when I compare the execution time of my FSBL with and without FPU support, it logically ends up with the same result.

Of course Cortex A9 is best for parallelized comparisons, MAC operation, arithmetical on a large dataset... thanks to its 64 extension data registers.

However for memory copy, FPU does not seem to bring any advantage, especially when caches are enabled (FPU/NEON uses L2 cache but not L1 cache). 

So why is there a different version of memcpy when FPU is enabled? for which purpose is FPU load/store capacity useful?

Thank you for your hints.

Florian

  • Load or store 128bit with a single instruction.

  • Thank you Bastien.

    So it seems the memcpy from gcc-arm-9.2-2019.12-x86_64-arm-none-eabi does not exploit FPU properly.

    Here is an extract of the disassembly I get for memcpy with FPU:

    1e2108a4:    ed8c0b0a     vstr    d0, [ip, #40]    ; 0x28
    1e2108a8:    ed910b0a     vldr    d0, [r1, #40]    ; 0x28
    1e2108ac:    ed8c1b0c     vstr    d1, [ip, #48]    ; 0x30
    1e2108b0:    ed911b0c     vldr    d1, [r1, #48]    ; 0x30
    1e2108b4:    ed8c2b0e     vstr    d2, [ip, #56]    ; 0x38
    1e2108b8:    ed912b0e     vldr    d2, [r1, #56]    ; 0x38
    1e2108bc:    ed8c4b10     vstr    d4, [ip, #64]    ; 0x40
    1e2108c0:    ed914b10     vldr    d4, [r1, #64]    ; 0x40

    And without FPU:

    1e210c84:    e1cc22f8     strd    r2, [ip, #40]    ; 0x28
    1e210c88:    e1c122d8     ldrd    r2, [r1, #40]    ; 0x28
    1e210c8c:    e1cc43f0     strd    r4, [ip, #48]    ; 0x30
    1e210c90:    e1c143d0     ldrd    r4, [r1, #48]    ; 0x30
    1e210c94:    e1cc63f8     strd    r6, [ip, #56]    ; 0x38
    1e210c98:    e1c163d8     ldrd    r6, [r1, #56]    ; 0x38
    1e210c9c:    e1ec84f0     strd    r8, [ip, #64]!    ; 0x40
    1e210ca0:    e1e184d0     ldrd    r8, [r1, #64]!    ; 0x40

    Every copy is 64-bit large.

  • But in case of the FPU, you destroy 4 registers. W/o it is 8 registers. Since r4-r8 are callee saved, you need to push and restore them.

    So you need to check the overall "costs" of the FPU vs. non-FPU version.