This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why MEMSET is not optimized?

I'm using ARM DS 2021 with compiler 6.16 and Cortex-A53 CPU in aarh64 mode.

I wrote small program with call to memset() function, set high optimization O2 (tried O3 and Omax also) and disassembly shows this:

  _memset
        0x00008c14:    b4000261    a...    CBZ      x1,0x8c60 ; _memset + 76
        0x00008c18:    36000060    `..6    TBZ      w0,#0,0x8c24 ; _memset + 16
        0x00008c1c:    38001402    ...8    STRB     w2,[x0],#1
        0x00008c20:    d1000421    !...    SUB      x1,x1,#1
        0x00008c24:    f1000828    (...    SUBS     x8,x1,#2
        0x00008c28:    54000143    C..T    B.CC     0x8c50 ; _memset + 60
        0x00008c2c:    36080060    `..6    TBZ      w0,#1,0x8c38 ; _memset + 36
        0x00008c30:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c34:    aa0803e1    ....    MOV      x1,x8
        0x00008c38:    f100103f    ?...    CMP      x1,#4
        0x00008c3c:    540000a3    ...T    B.CC     0x8c50 ; _memset + 60
        0x00008c40:    d1001021    !...    SUB      x1,x1,#4
        0x00008c44:    f1000c3f    ?...    CMP      x1,#3
        0x00008c48:    b8004402    .D..    STR      w2,[x0],#4
        0x00008c4c:    54ffffa8    ...T    B.HI     0x8c40 ; _memset + 44
        0x00008c50:    36080041    A..6    TBZ      w1,#1,0x8c58 ; _memset + 68
        0x00008c54:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c58:    36000041    A..6    TBZ      w1,#0,0x8c60 ; _memset + 76
        0x00008c5c:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008c60:    d65f03c0    .._.    RET      
    __aeabi_memclr4
    __aeabi_memclr8
    __rt_memclr_w
        0x00008c64:    f100103f    ?...    CMP      x1,#4
        0x00008c68:    540000a3    ...T    B.CC     0x8c7c ; __aeabi_memclr4 + 24
        0x00008c6c:    d1001021    !...    SUB      x1,x1,#4
        0x00008c70:    f1000c3f    ?...    CMP      x1,#3
        0x00008c74:    b800441f    .D..    STR      wzr,[x0],#4
        0x00008c78:    54ffffa8    ...T    B.HI     0x8c6c ; __aeabi_memclr4 + 8
        0x00008c7c:    37080061    a..7    TBNZ     w1,#1,0x8c88 ; __aeabi_memclr4 + 36
        0x00008c80:    37000081    ...7    TBNZ     w1,#0,0x8c90 ; __aeabi_memclr4 + 44
        0x00008c84:    d65f03c0    .._.    RET      
        0x00008c88:    7800241f    .$.x    STRH     wzr,[x0],#2
        0x00008c8c:    3607ffc1    ...6    TBZ      w1,#0,0x8c84 ; __aeabi_memclr4 + 32
        0x00008c90:    3900001f    ...9    STRB     wzr,[x0,#0]
        0x00008c94:    d65f03c0    .._.    RET      
    _memset_w
        0x00008c98:    f100103f    ?...    CMP      x1,#4
        0x00008c9c:    540000a3    ...T    B.CC     0x8cb0 ; _memset_w + 24
        0x00008ca0:    d1001021    !...    SUB      x1,x1,#4
        0x00008ca4:    f1000c3f    ?...    CMP      x1,#3
        0x00008ca8:    b8004402    .D..    STR      w2,[x0],#4
        0x00008cac:    54ffffa8    ...T    B.HI     0x8ca0 ; _memset_w + 8
        0x00008cb0:    37080061    a..7    TBNZ     w1,#1,0x8cbc ; _memset_w + 36
        0x00008cb4:    37000081    ...7    TBNZ     w1,#0,0x8cc4 ; _memset_w + 44
        0x00008cb8:    d65f03c0    .._.    RET      
        0x00008cbc:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008cc0:    3607ffc1    ...6    TBZ      w1,#0,0x8cb8 ; _memset_w + 32
        0x00008cc4:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008cc8:    d65f03c0    .._.    RET      

Linker chooses library c_ou.l

As we can see, the function barely optimized, using at max. 32-bit accesses on 64-bit CPU.

No NEON registers used.

Why this function is so bad?

I thought "highly optimized libraries" should look a way better :(

  • Hi

    My name is Stephen and I work at Arm.

    When implementing a library, there is a trade-off to be made between code-size, performance and architectural features.  For example do you favor code-size over performance?  Can unaligned accesses be used?  What if the SIMD unit isn't enabled?  What if the memset is to device memory?

    In an ideal world we would provide multiple implementations for different contexts, as we did with some of the 32-bit Cortex processors.  Unfortunately, we've not yet been able to put as much effort into tuning for performance the AArch64 libraries provided with Arm Compiler 6.

    Arm's general effort in this area has gone into the Arm optimized routines open source project. These tend to favor high performance over all other concerns. Take a look at github.com/.../memset.S which uses unaligned accesses and Neon.

    If you build your own memset() then the linker won't select the built-in library version at link time.

    Hope this helps,

    Stephen

  • Hello, Stephen.


    From my experience with the ARM compiler v5, I used to take it as best in class, expecting that the sixth version would be no less in code quality.

    But my expectations were a little too high, I'm sorry.

    Thanks a lot for the routines link, these seems great and will be very helpful to me!

    Regards,

    Vlad

  • Note that the compiler and optimization settings of your project are essentially irrelevant when it comes to library functions like memset(); that's all library code that is pre-built and provided as .a files.

  • When calling asm functions from c code, would compiler automatically save/restore registers according to calling convention?

    For example, MEMCPY routine from the link above uses x0-x17 registers, which is a lot, and none of them saved on stack/restored...

  • Hi again

    The compiler will implement calling conventions described in the "Procedure Call Standard for the Arm 64-bit Architecture (AArch64)".  See github.com/.../aapcs64.rst

    In particular, see section 6.1.1 General-purpose Registers, which shows the roles of the registers.

    Stephen