This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OSv guest encountering EC - "Unknown Reason" sync exception (ESR = 0x2000000) on Raspberry PI 4B host with KVM on

Hi,

I am one of the OSv unikernel developers and I have been stuck trying to figure out the problem described here - github.com/.../1100 - for a long time now to no avail. And I am looking for any more suggestions on how to debug it further.

In essence, occasionally (20-30% of the time) OSv guest running on QEMU with KVM on Raspberry PI 4B host encounters "Unknown Reason" (ESR_EL1 = 0x2000000) sync exception when running the same test application (does not seem to be specific to any given application). It seems that the longer the application runs the more frequently this exception happens. Here is the state of the registers from the guest gdb debug session captured when exception encountered and after saving exception frame on the stack (sorted for convenience):

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(gdb) info reg
ACTLR_EL1 0x0 0
ACTLR_EL2 0x0 0
ACTLR_EL3 0x0 0
AFSR0_EL1 0x0 0
AFSR1_EL1 0x0 0
AIDR 0x0 0
AMAIR0 0x0 0
CLIDR 0x0 0
CONTEXTIDR_EL1 0x0 0
CPACR 0x300000 3145728
CSSELR 0x2 2
CTR_EL0 0x0 0
DACR32_EL2 0x1de7ec7edbadc0de 2154950976315703518
DBGBCR 0x0 0
DBGBVR 0x0 0
DBGWCR 0x0 0
DBGWVR 0x0 0
ELR_EL1 0x1000000311c0 17592186245568
ESR_EL1 0x2000000 33554432
FAR_EL1 0x1000000321b8 17592186249656
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

OSv, being unikernel runs its kernel and application in the same memory space at EL1. And here is the exceptions vectors table:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
.macro vector_entry label idx
/* every entry is at 2^7 bits distance */
.align 7
b \label
.endm
.global exception_vectors
.type exception_vectors, @function
.align 12
exception_vectors:
/* Current Exception level with SP_EL0 : unused */
vector_entry entry_invalid 0 // Synchronous
vector_entry entry_invalid 1 // IRQ or vIRQ
vector_entry entry_invalid 2 // FIQ or vFIQ
vector_entry entry_invalid 3 // SError or vSError
/* Current Exception level with SP_ELx : only actually used */
vector_entry entry_sync 4
vector_entry entry_irq 5
vector_entry entry_fiq 6
vector_entry entry_serror 7
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The most difficult issue is that I do not know why we occasionally encounter this exception. The ARMV8 documentation gives many reasons (developer.arm.com/.../esr_el1 but none of the ones I tried to investigate seems to match what I am seeing.

One of the reasons could be unallocated instruction or instruction using not-allowed system register (like MSR, MRS) but that does not seem to be the case. The same application seems to encounter the Uknown Reason exception (if it does) at various places in the code and there does not seem to be anything consistent about it, except it is always the app code and never the kernel part (see addresses above 0x0000100000000000). Here are some examples of instructions per ELR_EL1 (ESR_EL1 is always 0x2000000):

Fullscreen
1
2
3
4
5
6
7
8
Address, Instruction
---------------------------------------
0x0000100000031000 d63f0020 blr x1
0x0000100000031140 f9404c01 ldr x1, [x0, #152]
0x00001000000311c0 94000182 bl 17c8 <*ABS*@plt+0x7b8>
0x0000100000031600 b4000040 cbz x0, 1608 <*ABS*@plt+0x5f8>
0x0000100000031640 f00000e0 adrp x0, 20000 <*ABS*@plt+0x1eff0>
0x0000100000031200 2a1503e0 mov w0, w21
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

As you can see some instructions involve memory access but some do not (last example) and they all seem to be valid instructions.

The existing OSv sync exception handler looks like that:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
121 .global entry_invalid_from_sync
122 .type entry_invalid_from_sync, @function
123 entry_invalid_from_sync:
124 mrs x20, elr_el1 // Exception Link Register -> X20
125 mrs x21, spsr_el1 // Saved PSTATE -> X21
126 mrs x22, esr_el1 // Exception Syndrome Register -> X22
127
128 ubfm x23, x22, #ESR_EC_BEG, #ESR_EC_END // Exception Class -> X23
129 ubfm x24, x22, #ESR_ISS_BEG, #ESR_ISS_END // Instruction-Specific Syndrome -> X24
130
131 1: wfi
132 b 1b
133
134 .global entry_sync
135 .type entry_sync, @function
136 entry_sync:
137 push_state_to_exception_frame
138 mrs x1, esr_el1
139 ubfm x2, x1, #ESR_EC_BEG, #ESR_EC_END // Exception Class -> X2
140 ubfm x3, x1, #ESR_FLT_BEG, #ESR_FLT_END // FLT -> X3
141 cmp x2, #ESR_EC_DATA_ABORT
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

which makes OSv hang when encountering the Uknown Reason exception (see line 131,132).

Now I discovered that if I change the entry_invalid_from_sync routine to ignore the exception, pop the frame and let processing continue (like in the modified code below), the app and kernel continues and triggers the exact same exception again in the exact same address (same ELR_EL1 value) over and over again (~5,000 to ~20,000,000 times) UNTIL it eventually ceases to happen and the app successfully completes and OSv terminates. As I wrote at the very beginning of this email, the same app never encounters the EC=0 exception in 70-80% of the runs.

Fullscreen
1
2
3
4
5
6
7
121 .global entry_invalid_from_sync
122 .type entry_invalid_from_sync, @function
123 entry_invalid_from_sync:
124 mov x0, sp // save exception_frame to x0
125 bl entry_sync_invalid_handler
126 pop_state_from_exception_frame
127 eret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


Any ideas on what could be wrong or on how to further debug this problem? I can not see anything interesting in the host kern.log nor in QEMU log (qemu.log).

Please note that I have never been able to reproduce this issue with the same code (kernel and app) in the emulated mode (TCG) so it always seems to happen with KVM acceleration only.

Below is some information about specifics of the host, qemu, etc:

The host is Ubuntu 20.04.1 running Raspberry PI 4B with 4GB of RAM booting from SSD:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
qemu-system-aarch64 --version
QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.8)
Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
uname -a
Linux ubuntu 5.4.0-1022-raspi #25-Ubuntu SMP PREEMPT Thu Oct 15 13:31:49 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


Here is also the OSv guest memory layout of the kernel, application, and app stacks just in case.

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0x0000000040090000 0x00000000407ac000 [KERNEL]
0x0000100000000000 0x0000100000001000 [4.0 kB] flags=fmF perm=rx offset=0x00000000 path=/libvdso.so
0x000010000001f000 0x0000100000020000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/libvdso.so
0x0000100000020000 0x0000100000021000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/libvdso.so
0x0000100000030000 0x0000100000033000 [12.0 kB] flags=fmF perm=rx offset=0x00000000 path=/tests/tst-tls.so
0x000010000004f000 0x0000100000050000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/tests/tst-tls.so
0x0000100000050000 0x0000100000051000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/tests/tst-tls.so
0x0000100000060000 0x0000100000061000 [4.0 kB] flags=fmF perm=rx offset=0x00000000 path=/tests/libtls.so
0x000010000007f000 0x0000100000080000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/tests/libtls.so
0x0000100000080000 0x0000100000081000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/tests/libtls.so
0x0000100000090000 0x00001000000a3000 [76.0 kB] flags=fmF perm=rx offset=0x00000000 path=/usr/lib/libgcc_s.so.1
0x00001000000bf000 0x00001000000c0000 [4.0 kB] flags=fmF perm=r offset=0x0001f000 path=/usr/lib/libgcc_s.so.1
0x00001000000c0000 0x00001000000c1000 [4.0 kB] flags=fmF perm=rw offset=0x00020000 path=/usr/lib/libgcc_s.so.1
0x0000200000000000 0x0000200000001000 [4.0 kB] flags=p perm=none
0x0000200000001000 0x0000200000002000 [4.0 kB] flags=p perm=none
0x0000200000002000 0x0000200000101000 [1020.0 kB] flags=p perm=rw
0x0000200000101000 0x0000200000102000 [4.0 kB] flags=p perm=none
0x0000200000102000 0x0000200000201000 [1020.0 kB] flags=p perm=rw
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

My regards,
Waldemar Kozaczuk

PS. I have posted a similar question on QEMU ARM forum, but since then I have verified I can reproduce exact same issue when running OSv on Firecracker hypervisor (https://firecracker-microvm.github.io/) on the same host. Firecracker uses KVM acceleration exclusively. This makes me think that this issue probably is not caused by a bug in QEMU.

0