I'm using ARM DS 2021 with compiler 6.16 and Cortex-A53 CPU in aarh64 mode.
I wrote small program with call to memset() function, set high optimization O2 (tried O3 and Omax also) and disassembly shows this:
_memset 0x00008c14: b4000261 a... CBZ x1,0x8c60 ; _memset + 76 0x00008c18: 36000060 `..6 TBZ w0,#0,0x8c24 ; _memset + 16 0x00008c1c: 38001402 ...8 STRB w2,[x0],#1 0x00008c20: d1000421 !... SUB x1,x1,#1 0x00008c24: f1000828 (... SUBS x8,x1,#2 0x00008c28: 54000143 C..T B.CC 0x8c50 ; _memset + 60 0x00008c2c: 36080060 `..6 TBZ w0,#1,0x8c38 ; _memset + 36 0x00008c30: 78002402 .$.x STRH w2,[x0],#2 0x00008c34: aa0803e1 .... MOV x1,x8 0x00008c38: f100103f ?... CMP x1,#4 0x00008c3c: 540000a3 ...T B.CC 0x8c50 ; _memset + 60 0x00008c40: d1001021 !... SUB x1,x1,#4 0x00008c44: f1000c3f ?... CMP x1,#3 0x00008c48: b8004402 .D.. STR w2,[x0],#4 0x00008c4c: 54ffffa8 ...T B.HI 0x8c40 ; _memset + 44 0x00008c50: 36080041 A..6 TBZ w1,#1,0x8c58 ; _memset + 68 0x00008c54: 78002402 .$.x STRH w2,[x0],#2 0x00008c58: 36000041 A..6 TBZ w1,#0,0x8c60 ; _memset + 76 0x00008c5c: 39000002 ...9 STRB w2,[x0,#0] 0x00008c60: d65f03c0 .._. RET __aeabi_memclr4 __aeabi_memclr8 __rt_memclr_w 0x00008c64: f100103f ?... CMP x1,#4 0x00008c68: 540000a3 ...T B.CC 0x8c7c ; __aeabi_memclr4 + 24 0x00008c6c: d1001021 !... SUB x1,x1,#4 0x00008c70: f1000c3f ?... CMP x1,#3 0x00008c74: b800441f .D.. STR wzr,[x0],#4 0x00008c78: 54ffffa8 ...T B.HI 0x8c6c ; __aeabi_memclr4 + 8 0x00008c7c: 37080061 a..7 TBNZ w1,#1,0x8c88 ; __aeabi_memclr4 + 36 0x00008c80: 37000081 ...7 TBNZ w1,#0,0x8c90 ; __aeabi_memclr4 + 44 0x00008c84: d65f03c0 .._. RET 0x00008c88: 7800241f .$.x STRH wzr,[x0],#2 0x00008c8c: 3607ffc1 ...6 TBZ w1,#0,0x8c84 ; __aeabi_memclr4 + 32 0x00008c90: 3900001f ...9 STRB wzr,[x0,#0] 0x00008c94: d65f03c0 .._. RET _memset_w 0x00008c98: f100103f ?... CMP x1,#4 0x00008c9c: 540000a3 ...T B.CC 0x8cb0 ; _memset_w + 24 0x00008ca0: d1001021 !... SUB x1,x1,#4 0x00008ca4: f1000c3f ?... CMP x1,#3 0x00008ca8: b8004402 .D.. STR w2,[x0],#4 0x00008cac: 54ffffa8 ...T B.HI 0x8ca0 ; _memset_w + 8 0x00008cb0: 37080061 a..7 TBNZ w1,#1,0x8cbc ; _memset_w + 36 0x00008cb4: 37000081 ...7 TBNZ w1,#0,0x8cc4 ; _memset_w + 44 0x00008cb8: d65f03c0 .._. RET 0x00008cbc: 78002402 .$.x STRH w2,[x0],#2 0x00008cc0: 3607ffc1 ...6 TBZ w1,#0,0x8cb8 ; _memset_w + 32 0x00008cc4: 39000002 ...9 STRB w2,[x0,#0] 0x00008cc8: d65f03c0 .._. RET
Linker chooses library c_ou.l
As we can see, the function barely optimized, using at max. 32-bit accesses on 64-bit CPU.
No NEON registers used.
Why this function is so bad?
I thought "highly optimized libraries" should look a way better :(
Note that the compiler and optimization settings of your project are essentially irrelevant when it comes to library functions like memset(); that's all library code that is pre-built and provided as .a files.