This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Improve Performance of specific NEON functions using SVE/SVE2

Hello,

I have the following 3 functions that utilize NEON instruction set:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
function pixel_avg2_w8_neon, export=1
1:
subs w5, w5, #2
ld1 {v0.8b}, [x2], x3
ld1 {v2.8b}, [x4], x3
urhadd v0.8b, v0.8b, v2.8b
ld1 {v1.8b}, [x2], x3
ld1 {v3.8b}, [x4], x3
urhadd v1.8b, v1.8b, v3.8b
st1 {v0.8b}, [x0], x1
st1 {v1.8b}, [x0], x1
b.gt 1b
ret
endfunc
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
function pixel_avg2_w16_neon, export=1
1:
subs w5, w5, #2
ld1 {v0.16b}, [x2], x3
ld1 {v2.16b}, [x4], x3
urhadd v0.16b, v0.16b, v2.16b
ld1 {v1.16b}, [x2], x3
ld1 {v3.16b}, [x4], x3
urhadd v1.16b, v1.16b, v3.16b
st1 {v0.16b}, [x0], x1
st1 {v1.16b}, [x0], x1
b.gt 1b
ret
endfunc
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function pixel_sad_\h\()_neon, export=1
ld1 {v1.16b}, [x2], x3
ld1 {v0.16b}, [x0], x1
ld1 {v3.16b}, [x2], x3
ld1 {v2.16b}, [x0], x1
uabdl v16.8h, v0.8b, v1.8b
uabdl2 v17.8h, v0.16b, v1.16b
uabal v16.8h, v2.8b, v3.8b
uabal2 v17.8h, v2.16b, v3.16b
.rept \h / 2 - 1
ld1 {v1.16b}, [x2], x3
ld1 {v0.16b}, [x0], x1
ld1 {v3.16b}, [x2], x3
ld1 {v2.16b}, [x0], x1
uabal v16.8h, v0.8b, v1.8b
uabal2 v17.8h, v0.16b, v1.16b
uabal v16.8h, v2.8b, v3.8b
uabal2 v17.8h, v2.16b, v3.16b
.endr
add v16.8h, v16.8h, v17.8h
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I want to use SVE/SVE2 instructions set to improve the performance of these functions. My testbed is Alibaba Yitian 710 (vector size=128 bits).

For the first 2, I couldn't find a way to improve the performance. For the latter, I wrote the following function:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function pixel_sad_\h\()_sve, export=1
ptrue p0.h, vl8
ld1b {z1.h}, p0/z, [x2]
ld1b {z4.h}, p0/z, [x2, #1, mul vl]
add x2, x2, x3
ld1b {z3.h}, p0/z, [x2]
ld1b {z6.h}, p0/z, [x2, #1, mul vl]
add x2, x2, x3
ld1b {z0.h}, p0/z, [x0]
ld1b {z5.h}, p0/z, [x0, #1, mul vl]
add x0, x0, x1
ld1b {z2.h}, p0/z, [x0]
ld1b {z7.h}, p0/z, [x0, #1, mul vl]
add x0, x0, x1
uabd v16.8h, v0.8h, v1.8h
uabd v17.8h, v4.8h, v5.8h
uaba v16.8h, v2.8h, v3.8h
uaba v17.8h, v7.8h, v6.8h
.rept \h / 2 - 1
ld1b {z1.h}, p0/z, [x2]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

However, this degrades the performance instead of improving it.

Can someone help me?

Thank you in advance,

Akis

0