Why MEMSET is not optimized?

I'm using ARM DS 2021 with compiler 6.16 and Cortex-A53 CPU in aarh64 mode.

I wrote small program with call to memset() function, set high optimization O2 (tried O3 and Omax also) and disassembly shows this:

  _memset
        0x00008c14:    b4000261    a...    CBZ      x1,0x8c60 ; _memset + 76
        0x00008c18:    36000060    `..6    TBZ      w0,#0,0x8c24 ; _memset + 16
        0x00008c1c:    38001402    ...8    STRB     w2,[x0],#1
        0x00008c20:    d1000421    !...    SUB      x1,x1,#1
        0x00008c24:    f1000828    (...    SUBS     x8,x1,#2
        0x00008c28:    54000143    C..T    B.CC     0x8c50 ; _memset + 60
        0x00008c2c:    36080060    `..6    TBZ      w0,#1,0x8c38 ; _memset + 36
        0x00008c30:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c34:    aa0803e1    ....    MOV      x1,x8
        0x00008c38:    f100103f    ?...    CMP      x1,#4
        0x00008c3c:    540000a3    ...T    B.CC     0x8c50 ; _memset + 60
        0x00008c40:    d1001021    !...    SUB      x1,x1,#4
        0x00008c44:    f1000c3f    ?...    CMP      x1,#3
        0x00008c48:    b8004402    .D..    STR      w2,[x0],#4
        0x00008c4c:    54ffffa8    ...T    B.HI     0x8c40 ; _memset + 44
        0x00008c50:    36080041    A..6    TBZ      w1,#1,0x8c58 ; _memset + 68
        0x00008c54:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008c58:    36000041    A..6    TBZ      w1,#0,0x8c60 ; _memset + 76
        0x00008c5c:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008c60:    d65f03c0    .._.    RET      
    __aeabi_memclr4
    __aeabi_memclr8
    __rt_memclr_w
        0x00008c64:    f100103f    ?...    CMP      x1,#4
        0x00008c68:    540000a3    ...T    B.CC     0x8c7c ; __aeabi_memclr4 + 24
        0x00008c6c:    d1001021    !...    SUB      x1,x1,#4
        0x00008c70:    f1000c3f    ?...    CMP      x1,#3
        0x00008c74:    b800441f    .D..    STR      wzr,[x0],#4
        0x00008c78:    54ffffa8    ...T    B.HI     0x8c6c ; __aeabi_memclr4 + 8
        0x00008c7c:    37080061    a..7    TBNZ     w1,#1,0x8c88 ; __aeabi_memclr4 + 36
        0x00008c80:    37000081    ...7    TBNZ     w1,#0,0x8c90 ; __aeabi_memclr4 + 44
        0x00008c84:    d65f03c0    .._.    RET      
        0x00008c88:    7800241f    .$.x    STRH     wzr,[x0],#2
        0x00008c8c:    3607ffc1    ...6    TBZ      w1,#0,0x8c84 ; __aeabi_memclr4 + 32
        0x00008c90:    3900001f    ...9    STRB     wzr,[x0,#0]
        0x00008c94:    d65f03c0    .._.    RET      
    _memset_w
        0x00008c98:    f100103f    ?...    CMP      x1,#4
        0x00008c9c:    540000a3    ...T    B.CC     0x8cb0 ; _memset_w + 24
        0x00008ca0:    d1001021    !...    SUB      x1,x1,#4
        0x00008ca4:    f1000c3f    ?...    CMP      x1,#3
        0x00008ca8:    b8004402    .D..    STR      w2,[x0],#4
        0x00008cac:    54ffffa8    ...T    B.HI     0x8ca0 ; _memset_w + 8
        0x00008cb0:    37080061    a..7    TBNZ     w1,#1,0x8cbc ; _memset_w + 36
        0x00008cb4:    37000081    ...7    TBNZ     w1,#0,0x8cc4 ; _memset_w + 44
        0x00008cb8:    d65f03c0    .._.    RET      
        0x00008cbc:    78002402    .$.x    STRH     w2,[x0],#2
        0x00008cc0:    3607ffc1    ...6    TBZ      w1,#0,0x8cb8 ; _memset_w + 32
        0x00008cc4:    39000002    ...9    STRB     w2,[x0,#0]
        0x00008cc8:    d65f03c0    .._.    RET      

Linker chooses library c_ou.l

As we can see, the function barely optimized, using at max. 32-bit accesses on 64-bit CPU.

No NEON registers used.

Why this function is so bad?

I thought "highly optimized libraries" should look a way better :(