I'm using ARM DS 2021 with compiler 6.16 and Cortex-A53 CPU in aarh64 mode.
I wrote small program with call to memset() function, set high optimization O2 (tried O3 and Omax also) and disassembly shows this:
_memset 0x00008c14: b4000261 a... CBZ x1,0x8c60 ; _memset + 76 0x00008c18: 36000060 `..6 TBZ w0,#0,0x8c24 ; _memset + 16 0x00008c1c: 38001402 ...8 STRB w2,[x0],#1 0x00008c20: d1000421 !... SUB x1,x1,#1 0x00008c24: f1000828 (... SUBS x8,x1,#2 0x00008c28: 54000143 C..T B.CC 0x8c50 ; _memset + 60 0x00008c2c: 36080060 `..6 TBZ w0,#1,0x8c38 ; _memset + 36 0x00008c30: 78002402 .$.x STRH w2,[x0],#2 0x00008c34: aa0803e1 .... MOV x1,x8 0x00008c38: f100103f ?... CMP x1,#4 0x00008c3c: 540000a3 ...T B.CC 0x8c50 ; _memset + 60 0x00008c40: d1001021 !... SUB x1,x1,#4 0x00008c44: f1000c3f ?... CMP x1,#3 0x00008c48: b8004402 .D.. STR w2,[x0],#4 0x00008c4c: 54ffffa8 ...T B.HI 0x8c40 ; _memset + 44 0x00008c50: 36080041 A..6 TBZ w1,#1,0x8c58 ; _memset + 68 0x00008c54: 78002402 .$.x STRH w2,[x0],#2 0x00008c58: 36000041 A..6 TBZ w1,#0,0x8c60 ; _memset + 76 0x00008c5c: 39000002 ...9 STRB w2,[x0,#0] 0x00008c60: d65f03c0 .._. RET __aeabi_memclr4 __aeabi_memclr8 __rt_memclr_w 0x00008c64: f100103f ?... CMP x1,#4 0x00008c68: 540000a3 ...T B.CC 0x8c7c ; __aeabi_memclr4 + 24 0x00008c6c: d1001021 !... SUB x1,x1,#4 0x00008c70: f1000c3f ?... CMP x1,#3 0x00008c74: b800441f .D.. STR wzr,[x0],#4 0x00008c78: 54ffffa8 ...T B.HI 0x8c6c ; __aeabi_memclr4 + 8 0x00008c7c: 37080061 a..7 TBNZ w1,#1,0x8c88 ; __aeabi_memclr4 + 36 0x00008c80: 37000081 ...7 TBNZ w1,#0,0x8c90 ; __aeabi_memclr4 + 44 0x00008c84: d65f03c0 .._. RET 0x00008c88: 7800241f .$.x STRH wzr,[x0],#2 0x00008c8c: 3607ffc1 ...6 TBZ w1,#0,0x8c84 ; __aeabi_memclr4 + 32 0x00008c90: 3900001f ...9 STRB wzr,[x0,#0] 0x00008c94: d65f03c0 .._. RET _memset_w 0x00008c98: f100103f ?... CMP x1,#4 0x00008c9c: 540000a3 ...T B.CC 0x8cb0 ; _memset_w + 24 0x00008ca0: d1001021 !... SUB x1,x1,#4 0x00008ca4: f1000c3f ?... CMP x1,#3 0x00008ca8: b8004402 .D.. STR w2,[x0],#4 0x00008cac: 54ffffa8 ...T B.HI 0x8ca0 ; _memset_w + 8 0x00008cb0: 37080061 a..7 TBNZ w1,#1,0x8cbc ; _memset_w + 36 0x00008cb4: 37000081 ...7 TBNZ w1,#0,0x8cc4 ; _memset_w + 44 0x00008cb8: d65f03c0 .._. RET 0x00008cbc: 78002402 .$.x STRH w2,[x0],#2 0x00008cc0: 3607ffc1 ...6 TBZ w1,#0,0x8cb8 ; _memset_w + 32 0x00008cc4: 39000002 ...9 STRB w2,[x0,#0] 0x00008cc8: d65f03c0 .._. RET
Linker chooses library c_ou.l
As we can see, the function barely optimized, using at max. 32-bit accesses on 64-bit CPU.
No NEON registers used.
Why this function is so bad?
I thought "highly optimized libraries" should look a way better :(
HiMy name is Stephen and I work at Arm.When implementing a library, there is a trade-off to be made between code-size, performance and architectural features. For example do you favor code-size over performance? Can unaligned accesses be used? What if the SIMD unit isn't enabled? What if the memset is to device memory?In an ideal world we would provide multiple implementations for different contexts, as we did with some of the 32-bit Cortex processors. Unfortunately, we've not yet been able to put as much effort into tuning for performance the AArch64 libraries provided with Arm Compiler 6.Arm's general effort in this area has gone into the Arm optimized routines open source project. These tend to favor high performance over all other concerns. Take a look at github.com/.../memset.S which uses unaligned accesses and Neon.If you build your own memset() then the linker won't select the built-in library version at link time.
Hope this helps,
Stephen
Hello, Stephen.
From my experience with the ARM compiler v5, I used to take it as best in class, expecting that the sixth version would be no less in code quality.
But my expectations were a little too high, I'm sorry.
Thanks a lot for the routines link, these seems great and will be very helpful to me!
Regards,
Vlad
Note that the compiler and optimization settings of your project are essentially irrelevant when it comes to library functions like memset(); that's all library code that is pre-built and provided as .a files.
When calling asm functions from c code, would compiler automatically save/restore registers according to calling convention?
For example, MEMCPY routine from the link above uses x0-x17 registers, which is a lot, and none of them saved on stack/restored...
Hi againThe compiler will implement calling conventions described in the "Procedure Call Standard for the Arm 64-bit Architecture (AArch64)". See github.com/.../aapcs64.rstIn particular, see section 6.1.1 General-purpose Registers, which shows the roles of the registers.Stephen