This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why MEMSET is not optimized?

I'm using ARM DS 2021 with compiler 6.16 and Cortex-A53 CPU in aarh64 mode.

I wrote small program with call to memset() function, set high optimization O2 (tried O3 and Omax also) and disassembly shows this:

  _memset
        0x00008c14:    b4000261    a...    CBZ      x1,0x8c60 ; _memset + 76
        0x00008c18:    36000060    `..6    TBZ      w0,#0,0x8c24 ; _memset + 16
        0x00008c1c:    38001402    ...8    STRB     w2,[x0],#1
        0x00008c20:    d1000421    !...    SUB      x1,x1,#1
        0x00008c24:    f1000828    (...    SUBS     x8,x1,#2
        0x00008c28:    54000143    C..T    B.CC     0x8c50 ; _memset + 60
        0x00008c2c:    36080060    `..6    TBZ      w0,#1,0x8c38 ; _memset + 36
        0x00008c30:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c34:    aa0803e1    ....    MOV      x1,x8
        0x00008c38:    f100103f    ?...    CMP      x1,#4
        0x00008c3c:    540000a3    ...T    B.CC     0x8c50 ; _memset + 60
        0x00008c40:    d1001021    !...    SUB      x1,x1,#4
        0x00008c44:    f1000c3f    ?...    CMP      x1,#3
        0x00008c48:    b8004402    .D..    STR      w2,[x0],#4
        0x00008c4c:    54ffffa8    ...T    B.HI     0x8c40 ; _memset + 44
        0x00008c50:    36080041    A..6    TBZ      w1,#1,0x8c58 ; _memset + 68
        0x00008c54:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c58:    36000041    A..6    TBZ      w1,#0,0x8c60 ; _memset + 76
        0x00008c5c:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008c60:    d65f03c0    .._.    RET      
    __aeabi_memclr4
    __aeabi_memclr8
    __rt_memclr_w
        0x00008c64:    f100103f    ?...    CMP      x1,#4
        0x00008c68:    540000a3    ...T    B.CC     0x8c7c ; __aeabi_memclr4 + 24
        0x00008c6c:    d1001021    !...    SUB      x1,x1,#4
        0x00008c70:    f1000c3f    ?...    CMP      x1,#3
        0x00008c74:    b800441f    .D..    STR      wzr,[x0],#4
        0x00008c78:    54ffffa8    ...T    B.HI     0x8c6c ; __aeabi_memclr4 + 8
        0x00008c7c:    37080061    a..7    TBNZ     w1,#1,0x8c88 ; __aeabi_memclr4 + 36
        0x00008c80:    37000081    ...7    TBNZ     w1,#0,0x8c90 ; __aeabi_memclr4 + 44
        0x00008c84:    d65f03c0    .._.    RET      
        0x00008c88:    7800241f    .$.x    STRH     wzr,[x0],#2
        0x00008c8c:    3607ffc1    ...6    TBZ      w1,#0,0x8c84 ; __aeabi_memclr4 + 32
        0x00008c90:    3900001f    ...9    STRB     wzr,[x0,#0]
        0x00008c94:    d65f03c0    .._.    RET      
    _memset_w
        0x00008c98:    f100103f    ?...    CMP      x1,#4
        0x00008c9c:    540000a3    ...T    B.CC     0x8cb0 ; _memset_w + 24
        0x00008ca0:    d1001021    !...    SUB      x1,x1,#4
        0x00008ca4:    f1000c3f    ?...    CMP      x1,#3
        0x00008ca8:    b8004402    .D..    STR      w2,[x0],#4
        0x00008cac:    54ffffa8    ...T    B.HI     0x8ca0 ; _memset_w + 8
        0x00008cb0:    37080061    a..7    TBNZ     w1,#1,0x8cbc ; _memset_w + 36
        0x00008cb4:    37000081    ...7    TBNZ     w1,#0,0x8cc4 ; _memset_w + 44
        0x00008cb8:    d65f03c0    .._.    RET      
        0x00008cbc:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008cc0:    3607ffc1    ...6    TBZ      w1,#0,0x8cb8 ; _memset_w + 32
        0x00008cc4:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008cc8:    d65f03c0    .._.    RET      

Linker chooses library c_ou.l

As we can see, the function barely optimized, using at max. 32-bit accesses on 64-bit CPU.

No NEON registers used.

Why this function is so bad?

I thought "highly optimized libraries" should look a way better :(

Parents
  • Hi

    My name is Stephen and I work at Arm.

    When implementing a library, there is a trade-off to be made between code-size, performance and architectural features.  For example do you favor code-size over performance?  Can unaligned accesses be used?  What if the SIMD unit isn't enabled?  What if the memset is to device memory?

    In an ideal world we would provide multiple implementations for different contexts, as we did with some of the 32-bit Cortex processors.  Unfortunately, we've not yet been able to put as much effort into tuning for performance the AArch64 libraries provided with Arm Compiler 6.

    Arm's general effort in this area has gone into the Arm optimized routines open source project. These tend to favor high performance over all other concerns. Take a look at github.com/.../memset.S which uses unaligned accesses and Neon.

    If you build your own memset() then the linker won't select the built-in library version at link time.

    Hope this helps,

    Stephen

Reply
  • Hi

    My name is Stephen and I work at Arm.

    When implementing a library, there is a trade-off to be made between code-size, performance and architectural features.  For example do you favor code-size over performance?  Can unaligned accesses be used?  What if the SIMD unit isn't enabled?  What if the memset is to device memory?

    In an ideal world we would provide multiple implementations for different contexts, as we did with some of the 32-bit Cortex processors.  Unfortunately, we've not yet been able to put as much effort into tuning for performance the AArch64 libraries provided with Arm Compiler 6.

    Arm's general effort in this area has gone into the Arm optimized routines open source project. These tend to favor high performance over all other concerns. Take a look at github.com/.../memset.S which uses unaligned accesses and Neon.

    If you build your own memset() then the linker won't select the built-in library version at link time.

    Hope this helps,

    Stephen

Children