This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OSv guest encountering EC - "Unknown Reason" sync exception (ESR = 0x2000000) on Raspberry PI 4B host with KVM on

Hi,

I am one of the OSv unikernel developers and I have been stuck trying to figure out the problem described here - github.com/.../1100 - for a long time now to no avail. And I am looking for any more suggestions on how to debug it further.

In essence, occasionally (20-30% of the time) OSv guest running on QEMU with KVM on Raspberry PI 4B host encounters "Unknown Reason" (ESR_EL1 = 0x2000000) sync exception when running the same test application (does not seem to be specific to any given application). It seems that the longer the application runs the more frequently this exception happens. Here is the state of the registers from the guest gdb debug session captured when exception encountered and after saving exception frame on the stack (sorted for convenience):

OSv, being unikernel runs its kernel and application in the same memory space at EL1. And here is the exceptions vectors table:

The most difficult issue is that I do not know why we occasionally encounter this exception. The ARMV8 documentation gives many reasons (developer.arm.com/.../esr_el1 but none of the ones I tried to investigate seems to match what I am seeing.

One of the reasons could be unallocated instruction or instruction using not-allowed system register (like MSR, MRS) but that does not seem to be the case. The same application seems to encounter the Uknown Reason exception (if it does) at various places in the code and there does not seem to be anything consistent about it, except it is always the app code and never the kernel part (see addresses above 0x0000100000000000). Here are some examples of instructions per ELR_EL1 (ESR_EL1 is always 0x2000000):

As you can see some instructions involve memory access but some do not (last example) and they all seem to be valid instructions.

The existing OSv sync exception handler looks like that:


which makes OSv hang when encountering the Uknown Reason exception (see line 131,132).

Now I discovered that if I change the entry_invalid_from_sync routine to ignore the exception, pop the frame and let processing continue (like in the modified code below), the app and kernel continues and triggers the exact same exception again in the exact same address (same ELR_EL1 value) over and over again (~5,000 to ~20,000,000 times) UNTIL it eventually ceases to happen and the app successfully completes and OSv terminates. As I wrote at the very beginning of this email, the same app never encounters the EC=0 exception in 70-80% of the runs.


Any ideas on what could be wrong or on how to further debug this problem? I can not see anything interesting in the host kern.log nor in QEMU log (qemu.log).

Please note that I have never been able to reproduce this issue with the same code (kernel and app) in the emulated mode (TCG) so it always seems to happen with KVM acceleration only.

Below is some information about specifics of the host, qemu, etc:

The host is Ubuntu 20.04.1 running Raspberry PI 4B with 4GB of RAM booting from SSD:


Here is also the OSv guest memory layout of the kernel, application, and app stacks just in case.

My regards,
Waldemar Kozaczuk

PS. I have posted a similar question on QEMU ARM forum, but since then I have verified I can reproduce exact same issue when running OSv on Firecracker hypervisor (https://firecracker-microvm.github.io/) on the same host. Firecracker uses KVM acceleration exclusively. This makes me think that this issue probably is not caused by a bug in QEMU.

0