Hello,
I want a reliable way for benchmarking asm code. More specifically, I want to compare execution time of asm code when is written using NEON instructions set and when using SVE2 instruction set.
For example, consider the following NEON code:
function PFX(blockcopy_sp_4x4_neon) lsl x3, x3, #1 .rept 2 ld1 {v0.8h}, [x2], x3 ld1 {v1.8h}, [x2], x3 xtn v0.8b, v0.8h xtn v1.8b, v1.8h st1 {v0.s}[0], [x0], x1 st1 {v1.s}[0], [x0], x1 .endr ret endfunc
The equivalent code using SVE2 can be:
function PFX(blockcopy_sp_4x4_sve2) ptrue p0.h, vl4 .rept 4 ld1h {z0.h}, p0/z, [x2] st1b {z0.h}, p0, [x0] add x2, x2, x3, lsl #1 add x0, x0, x1 .endr ret endfunc
How can I reliably determine which code is faster? Personally, I can identify the following ways:
1) Use hyperfine (https://github.com/sharkdp/hyperfine). Using this tool, I can execute the object code produced by the compilation of the asm code and compare the execution times. However, this method highly depends on whether the machine I use for running the object codes is executing other processes as well, how many context switches took place, and so on. And moreover, in a big project, it will be hard to isolate and benchmark only the parts of code that I am interested in.
2) Calculate the execution latency and throughput using for example the tables in https://developer.arm.com/documentation/PJDOC-466751330-18256/latest or https://developer.arm.com/documentation/PJDOC-466751330-593177/latest and compare these KPIs between different asm codes. However, I think that this can only be achieved manually (by hand), for example write down each instruction with its execution time and throughput, calculate the total number of cycles, and so on. So, if this can only be done manually, it is very hard procedure and error-prone.
Is there any other way? Is there a way for 2) to be done automatically by a tool (for example ARM development studio)?
Thank you in advance,
Akis