This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Benchmarking asm code

Hello,

I want a reliable way for benchmarking asm code. More specifically, I want to compare execution time of asm code when is written using NEON instructions set and when using SVE2 instruction set.

For example, consider the following NEON code:

function PFX(blockcopy_sp_4x4_neon)
    lsl             x3, x3, #1
.rept 2
    ld1             {v0.8h}, [x2], x3
    ld1             {v1.8h}, [x2], x3
    xtn             v0.8b, v0.8h
    xtn             v1.8b, v1.8h
    st1             {v0.s}[0], [x0], x1
    st1             {v1.s}[0], [x0], x1
.endr
    ret
endfunc

The equivalent code using SVE2 can be:

function PFX(blockcopy_sp_4x4_sve2)
    ptrue           p0.h, vl4
.rept 4
    ld1h            {z0.h}, p0/z, [x2]
    st1b            {z0.h}, p0, [x0]
    add             x2, x2, x3, lsl #1
    add             x0, x0, x1
.endr
    ret
endfunc

How can I reliably determine which code is faster? Personally, I can identify the following ways:

1) Use  hyperfine (https://github.com/sharkdp/hyperfine). Using this tool, I can execute the object code produced by the compilation of the asm code and compare the execution times. However, this method highly depends on whether the machine I use for running the object codes is executing other processes as well, how many context switches took place, and so on. And moreover, in a big project, it will be hard to isolate and benchmark only the parts of code that I am interested in.

2) Calculate the execution latency and throughput using for example the tables in https://developer.arm.com/documentation/PJDOC-466751330-18256/latest or https://developer.arm.com/documentation/PJDOC-466751330-593177/latest and compare these KPIs between different asm codes. However, I think that this can only be achieved manually (by hand), for example write down each instruction with its execution time and throughput, calculate the total number of cycles, and so on. So, if this can only be done manually, it is very hard procedure and error-prone.

Is there any other way? Is there a way for 2) to be done automatically by a tool (for example ARM development studio)?

Thank you in advance,

Akis