This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Question of Arm performance related to register allocation

I'm currently testing the code on an embedded board that is equipped with an ARM Cortex-A72 CPU based on the Armv8 architecture. The following code is used to measure the performance.

Fullscreen
1
2
3
4
5
6
7
8
9
10
void kernel_func(unsigned char* input_data, unsigned char* output_data)
{
int stride_size=1;
//assert(TEST_SIZE%byte_size == 0);
for(int i=0; i<100000000; i+=stride_size)
{
output_data[i] = input_data[i];
}
return;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

To test the above code, I divided it into three versions as shown below and measured their performance.

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void kernel_func_0(unsigned char* __restrict__ input_data, unsigned char* __restrict__ output_data)
{
int stride_size=1;
for(int i=0; i<100000000; i+=stride_size)
{
output_data[i] = input_data[i];
}
return;
}
void kernel_func_1(unsigned char* __restrict__ input_data, unsigned char* __restrict__ output_data)
{
int stride_size=16;
for(int i=0; i<100000000; i+=stride_size)
{
output_data[i+0 ] = input_data[i+0 ];
output_data[i+1 ] = input_data[i+1 ];
output_data[i+2 ] = input_data[i+2 ];
output_data[i+3 ] = input_data[i+3 ];
output_data[i+4 ] = input_data[i+4 ];
output_data[i+5 ] = input_data[i+5 ];
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The performance of the first code(kernel_func_0) is measured at around 8ms, the second code(kernel_func_1) at 8ms, and the third code(kernel_func_2) at around 11ms.

To identify the reason for the performance difference between the second and third code, I converted both codes into assembly code.

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
The Assembly code of KERNEL_FUNC_1
*/
.arch armv8.2-a+crc
.file "kernel.cpp"
.text
.align 2
.p2align 4,,11
.global _Z11kernel_funcPhS_
.type _Z11kernel_funcPhS_, %function
_Z11kernel_funcPhS_:
.LFB4340:
.cfi_startproc
mov x3, 57600
mov x2, 0
movk x3, 0x5f5, lsl 16
.p2align 3,,7
.L2:
ldr q0, [x0, x2]
str q0, [x1, x2]
add x2, x2, 16
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
The Assembly code of KERNEL_FUNC_2
*/
.arch armv8.2-a+crc
.file "kernel.cpp"
.text
.align 2
.p2align 4,,11
.global _Z11kernel_funcPhS_
.type _Z11kernel_funcPhS_, %function
_Z11kernel_funcPhS_:
.LFB4340:
.cfi_startproc
mov x3, 57600
add x5, x0, 16
add x4, x1, 16
mov x2, 0
movk x3, 0x5f5, lsl 16
.p2align 3,,7
.L2:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The majority of the performance for both assembly codes is determined by operations on ".L2:" labels. However, I believe that there should be no difference in performance between the two codes because if i modify the second assembly code(KERNEL_FUNC_1_MOD) as follows, it looks like execute same operation as the third assembly code(KERNEL_FUNC_2 CODE).

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
The Assembly code of KERNEL_FUNC_1_MOD
*/
.arch armv8.2-a+crc
.file "kernel.cpp"
.text
.align 2
.p2align 4,,11
.global _Z11kernel_funcPhS_
.type _Z11kernel_funcPhS_, %function
_Z11kernel_funcPhS_:
.LFB4340:
.cfi_startproc
mov x3, 57600
mov x2, 0
movk x3, 0x5f5, lsl 16
.p2align 3,,7
.L2:
ldr q0, [x0, x2]
add x2, x2, 16
ldr q1, [x0, x2]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

When running the two codes, "KERNEL_FUNC_1_Mod" takes 8ms and "KERNEL_FUNC_2" takes 11ms. It is difficult to understand why such results are produced. It's hard to comprehend that the performance differs by about 3ms just because of the difference of whether the address to load memory is written in the same register or not.

0