Hi,
I am one of the OSv unikernel developers and I have been stuck trying to figure out the problem described here - github.com/.../1100 - for a long time now to no avail. And I am looking for any more suggestions on how to debug it further.
In essence, occasionally (20-30% of the time) OSv guest running on QEMU with KVM on Raspberry PI 4B host encounters "Unknown Reason" (ESR_EL1 = 0x2000000) sync exception when running the same test application (does not seem to be specific to any given application). It seems that the longer the application runs the more frequently this exception happens. Here is the state of the registers from the guest gdb debug session captured when exception encountered and after saving exception frame on the stack (sorted for convenience):
(gdb) info reg ACTLR_EL1 0x0 0 ACTLR_EL2 0x0 0 ACTLR_EL3 0x0 0 AFSR0_EL1 0x0 0 AFSR1_EL1 0x0 0 AIDR 0x0 0 AMAIR0 0x0 0 CLIDR 0x0 0 CONTEXTIDR_EL1 0x0 0 CPACR 0x300000 3145728 CSSELR 0x2 2 CTR_EL0 0x0 0 DACR32_EL2 0x1de7ec7edbadc0de 2154950976315703518 DBGBCR 0x0 0 DBGBVR 0x0 0 DBGWCR 0x0 0 DBGWVR 0x0 0 ELR_EL1 0x1000000311c0 17592186245568 ESR_EL1 0x2000000 33554432 FAR_EL1 0x1000000321b8 17592186249656 FPCR 0x0 0 FPEXC32_EL2 0x700 1792 FPSR 0x14 20 ID_AA64AFR0_EL1 0x0 0 ID_AA64AFR1_EL1 0x0 0 ID_AA64AFR2_EL1_RESERVED 0x0 0 ID_AA64AFR3_EL1_RESERVED 0x0 0 ID_AA64DFR0_EL1 0x0 0 ID_AA64DFR1_EL1 0x0 0 ID_AA64DFR2_EL1_RESERVED 0x0 0 ID_AA64DFR3_EL1_RESERVED 0x0 0 ID_AA64ISAR0_EL1 0x10000 65536 ID_AA64ISAR1_EL1 0x0 0 ID_AA64ISAR2_EL1_RESERVED 0x0 0 ID_AA64ISAR3_EL1_RESERVED 0x0 0 ID_AA64ISAR4_EL1_RESERVED 0x0 0 ID_AA64ISAR5_EL1_RESERVED 0x0 0 ID_AA64ISAR6_EL1_RESERVED 0x0 0 ID_AA64ISAR7_EL1_RESERVED 0x0 0 ID_AA64MMFR0_EL1 0x1124 4388 ID_AA64MMFR1_EL1 0x0 0 ID_AA64MMFR2_EL1_RESERVED 0x0 0 ID_AA64MMFR3_EL1_RESERVED 0x0 0 ID_AA64MMFR4_EL1_RESERVED 0x0 0 ID_AA64MMFR5_EL1_RESERVED 0x0 0 ID_AA64MMFR6_EL1_RESERVED 0x0 0 ID_AA64MMFR7_EL1_RESERVED 0x0 0 ID_AA64PFR1_EL1 0x0 0 ID_AA64PFR2_EL1_RESERVED 0x0 0 ID_AA64PFR3_EL1_RESERVED 0x0 0 ID_AA64PFR5_EL1_RESERVED 0x0 0 ID_AA64PFR6_EL1_RESERVED 0x0 0 ID_AA64PFR7_EL1_RESERVED 0x0 0 ID_AA64ZFR0_EL1 0x0 0 ID_AFR0 0x0 0 ID_DFR0 0x0 0 ID_ISAR0 0x2101110 34607376 ID_ISAR1 0x13112111 319889681 ID_ISAR2 0x21232042 555950146 ID_ISAR3 0x1112131 17899825 ID_ISAR4 0x11142 69954 ID_ISAR5 0x10001 65537 ID_ISAR6 0x0 0 ID_MMFR0 0x0 0 ID_MMFR1 0x0 0 ID_MMFR2 0x0 0 ID_MMFR3 0x0 0 ID_MMFR4 0x0 0 ID_PFR0 0x0 0 IFSR32_EL2 0x1de7ec7edbadc0de 2154950976315703518 MAIR_EL1 0xff440c0400 1096358298624 MAIR_EL3 0x0 0 MDCCSR_EL0 0x0 0 MDCR_EL3 0x0 0 MDRAR_EL1 0x0 0 MDSCR_EL1 0x0 0 MVFR0_EL1 0x10110222 269550114 MVFR1_EL1 0x12111111 303108369 MVFR2_EL1 0x43 67 MVFR3_EL1_RESERVED 0x0 0 MVFR4_EL1_RESERVED 0x0 0 MVFR5_EL1_RESERVED 0x0 0 MVFR6_EL1_RESERVED 0x0 0 MVFR7_EL1_RESERVED 0x0 0 OSLSR_EL1 0xa 10 PAR_EL1 0x1d000c7edbadc0de 2089683966090920158 PMCCFILTR_EL0 0x0 0 PMCCNTR_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMCEID0_EL0 0x20001 131073 PMCEID1_EL0 0x0 0 PMCNTENCLR_EL0 0x5badc0c0 1538113728 PMCNTENSET_EL0 0x5badc0c0 1538113728 PMCR_EL0 0x4102302c 1090662444 PMEVCNTR0_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVCNTR1_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVCNTR2_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVCNTR3_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVTYPER0_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVTYPER1_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVTYPER2_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMEVTYPER3_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMINTENCLR_EL1 0x5badc0c0 1538113728 PMINTENSET_EL1 0x5badc0c0 1538113728 PMOVSCLR_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMOVSSET_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMSELR_EL0 0x1de7ec7edbadc0de 2154950976315703518 PMUSERENR_EL0 0x0 0 REVIDR_EL1 0x0 0 RVBAR_EL1 0x0 0 SCTLR 0x30d0199d 818944413 SPSR_ABT 0x0 0 SPSR_EL1 0x80000345 2147484485 SPSR_FIQ 0x0 0 SPSR_IRQ 0x0 0 SPSR_UND 0x0 0 SP_EL0 0x0 0 SP_EL1 0x200000200d60 35184374189408 TCR_EL1 0x15b5103510 93232051472 TPIDRRO_EL0 0x1de7ec7edbadc0de 2154950976315703518 TPIDR_EL0 0xffff900040f78000 -123144212348928 TPIDR_EL1 0x1de7ec7edbadc0de 2154950976315703518 TTBR0_EL1 0x407bf000 1081864192 TTBR1_EL1 0x407b5000 1081823232 VBAR 0x40207000 1075867648 cpsr 0x800003c5 -2147482683 fpcr 0x0 0 fpsr 0x14 20 pc 0x402077a0 0x402077a0 <entry_invalid_from_sync> sp 0x200000200d60 0x200000200d60 x0 0x1 1 x1 0x2000000 33554432 x10 0xffffa00040ca9e00 -105552029245952 x11 0xffffa00040e1a600 -105552027736576 x12 0x100000032048 17592186249288 x13 0x1 1 x14 0x4100000000 279172874240 x15 0x0 0 x16 0x100000050000 17592186372096 x17 0x402b0240 1076560448 x18 0x0 0 x19 0x200000200ed0 35184374189776 x2 0x0 0 x20 0xffff900040f78000 -123144212348928 x21 0x1 1 x22 0x1 1 x23 0xffff900040a2f090 -123144217890672 x24 0x12ca90abac 80707890092 x25 0x0 0 x26 0xffffa00040f79200 -105552026299904 x27 0xffff8000407cd739 -140736406431943 x28 0xffffa00040f79278 -105552026299784 x29 0x200000200e80 35184374189696 x3 0x0 0 x30 0x1000000311b8 17592186245560 x4 0x1000000321e2 17592186249698 x5 0x200000200ee2 35184374189794 x6 0x76 118 x7 0x34 52 x8 0x34 52 x9 0x1 1
OSv, being unikernel runs its kernel and application in the same memory space at EL1. And here is the exceptions vectors table:
.macro vector_entry label idx /* every entry is at 2^7 bits distance */ .align 7 b \label .endm .global exception_vectors .type exception_vectors, @function .align 12 exception_vectors: /* Current Exception level with SP_EL0 : unused */ vector_entry entry_invalid 0 // Synchronous vector_entry entry_invalid 1 // IRQ or vIRQ vector_entry entry_invalid 2 // FIQ or vFIQ vector_entry entry_invalid 3 // SError or vSError /* Current Exception level with SP_ELx : only actually used */ vector_entry entry_sync 4 vector_entry entry_irq 5 vector_entry entry_fiq 6 vector_entry entry_serror 7 /* Lower Exception level in AArch64 : unused since we don't go to EL0 */ vector_entry entry_invalid 8 vector_entry entry_invalid 9 vector_entry entry_invalid 10 vector_entry entry_invalid 11 /* Lower Exception level in AArch32 : no El0, no AArch32 */ vector_entry entry_invalid 12 vector_entry entry_invalid 13 vector_entry entry_invalid 14 vector_entry entry_invalid 15
The most difficult issue is that I do not know why we occasionally encounter this exception. The ARMV8 documentation gives many reasons (developer.arm.com/.../esr_el1 but none of the ones I tried to investigate seems to match what I am seeing.
One of the reasons could be unallocated instruction or instruction using not-allowed system register (like MSR, MRS) but that does not seem to be the case. The same application seems to encounter the Uknown Reason exception (if it does) at various places in the code and there does not seem to be anything consistent about it, except it is always the app code and never the kernel part (see addresses above 0x0000100000000000). Here are some examples of instructions per ELR_EL1 (ESR_EL1 is always 0x2000000):
Address, Instruction --------------------------------------- 0x0000100000031000 d63f0020 blr x1 0x0000100000031140 f9404c01 ldr x1, [x0, #152] 0x00001000000311c0 94000182 bl 17c8 <*ABS*@plt+0x7b8> 0x0000100000031600 b4000040 cbz x0, 1608 <*ABS*@plt+0x5f8> 0x0000100000031640 f00000e0 adrp x0, 20000 <*ABS*@plt+0x1eff0> 0x0000100000031200 2a1503e0 mov w0, w21
As you can see some instructions involve memory access but some do not (last example) and they all seem to be valid instructions.
The existing OSv sync exception handler looks like that:
121 .global entry_invalid_from_sync 122 .type entry_invalid_from_sync, @function 123 entry_invalid_from_sync: 124 mrs x20, elr_el1 // Exception Link Register -> X20 125 mrs x21, spsr_el1 // Saved PSTATE -> X21 126 mrs x22, esr_el1 // Exception Syndrome Register -> X22 127 128 ubfm x23, x22, #ESR_EC_BEG, #ESR_EC_END // Exception Class -> X23 129 ubfm x24, x22, #ESR_ISS_BEG, #ESR_ISS_END // Instruction-Specific Syndrome -> X24 130 131 1: wfi 132 b 1b 133 134 .global entry_sync 135 .type entry_sync, @function 136 entry_sync: 137 push_state_to_exception_frame 138 mrs x1, esr_el1 139 ubfm x2, x1, #ESR_EC_BEG, #ESR_EC_END // Exception Class -> X2 140 ubfm x3, x1, #ESR_FLT_BEG, #ESR_FLT_END // FLT -> X3 141 cmp x2, #ESR_EC_DATA_ABORT 142 b.eq handle_mem_abort 143 cmp x2, #ESR_EC_INSN_ABORT 144 b.ne entry_invalid_from_sync 145 handle_mem_abort: 146 cbz x3, entry_invalid_from_mem_abort 147 cmp x3, #3 148 b.hi entry_invalid_from_mem_abort 149 150 mov x0, sp // save exception_frame to x0 151 bl page_fault 152 pop_state_from_exception_frame 153 eret
Now I discovered that if I change the entry_invalid_from_sync routine to ignore the exception, pop the frame and let processing continue (like in the modified code below), the app and kernel continues and triggers the exact same exception again in the exact same address (same ELR_EL1 value) over and over again (~5,000 to ~20,000,000 times) UNTIL it eventually ceases to happen and the app successfully completes and OSv terminates. As I wrote at the very beginning of this email, the same app never encounters the EC=0 exception in 70-80% of the runs.
121 .global entry_invalid_from_sync 122 .type entry_invalid_from_sync, @function 123 entry_invalid_from_sync: 124 mov x0, sp // save exception_frame to x0 125 bl entry_sync_invalid_handler 126 pop_state_from_exception_frame 127 eret
Any ideas on what could be wrong or on how to further debug this problem? I can not see anything interesting in the host kern.log nor in QEMU log (qemu.log).
Please note that I have never been able to reproduce this issue with the same code (kernel and app) in the emulated mode (TCG) so it always seems to happen with KVM acceleration only.
Below is some information about specifics of the host, qemu, etc:
The host is Ubuntu 20.04.1 running Raspberry PI 4B with 4GB of RAM booting from SSD:
qemu-system-aarch64 --version QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.8) Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers uname -a Linux ubuntu 5.4.0-1022-raspi #25-Ubuntu SMP PREEMPT Thu Oct 15 13:31:49 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 3 Model name: Cortex-A72 Stepping: r0p3 CPU max MHz: 1500.0000 CPU min MHz: 600.0000 BogoMIPS: 108.00 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Vulnerable Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fp asimd evtstrm crc32 cpuid The qemu command line: qemu-system-aarch64 \ -m 128M \ -smp 1 \ --nographic \ -gdb tcp::1234,server,nowait \ -kernel /home/wkozaczuk/projects/osv/build/last/loader.img \ -append "--disable_rofs_cache /tests/tst-tls.so" \ -machine virt \ -device virtio-blk-pci,id=blk0,drive=hd0,scsi=off \ -drive file=/home/wkozaczuk/projects/osv/build/release/tests_disk.img,if=none,id=hd0,cache=none,aio=native \ -netdev user,id=un0,net=192.168.122.0/24,host=192.168.122.1 \ -device virtio-net-pci,netdev=un0 \ -device virtio-rng-pci \ -enable-kvm \ -cpu host \ -d all \ -D ./qemu.log
Here is also the OSv guest memory layout of the kernel, application, and app stacks just in case.
0x0000000040090000 0x00000000407ac000 [KERNEL] 0x0000100000000000 0x0000100000001000 [4.0 kB] flags=fmF perm=rx offset=0x00000000 path=/libvdso.so 0x000010000001f000 0x0000100000020000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/libvdso.so 0x0000100000020000 0x0000100000021000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/libvdso.so 0x0000100000030000 0x0000100000033000 [12.0 kB] flags=fmF perm=rx offset=0x00000000 path=/tests/tst-tls.so 0x000010000004f000 0x0000100000050000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/tests/tst-tls.so 0x0000100000050000 0x0000100000051000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/tests/tst-tls.so 0x0000100000060000 0x0000100000061000 [4.0 kB] flags=fmF perm=rx offset=0x00000000 path=/tests/libtls.so 0x000010000007f000 0x0000100000080000 [4.0 kB] flags=fmF perm=r offset=0x0000f000 path=/tests/libtls.so 0x0000100000080000 0x0000100000081000 [4.0 kB] flags=fmF perm=rw offset=0x00010000 path=/tests/libtls.so 0x0000100000090000 0x00001000000a3000 [76.0 kB] flags=fmF perm=rx offset=0x00000000 path=/usr/lib/libgcc_s.so.1 0x00001000000bf000 0x00001000000c0000 [4.0 kB] flags=fmF perm=r offset=0x0001f000 path=/usr/lib/libgcc_s.so.1 0x00001000000c0000 0x00001000000c1000 [4.0 kB] flags=fmF perm=rw offset=0x00020000 path=/usr/lib/libgcc_s.so.1 0x0000200000000000 0x0000200000001000 [4.0 kB] flags=p perm=none 0x0000200000001000 0x0000200000002000 [4.0 kB] flags=p perm=none 0x0000200000002000 0x0000200000101000 [1020.0 kB] flags=p perm=rw 0x0000200000101000 0x0000200000102000 [4.0 kB] flags=p perm=none 0x0000200000102000 0x0000200000201000 [1020.0 kB] flags=p perm=rw
My regards,Waldemar Kozaczuk
PS. I have posted a similar question on QEMU ARM forum, but since then I have verified I can reproduce exact same issue when running OSv on Firecracker hypervisor (https://firecracker-microvm.github.io/) on the same host. Firecracker uses KVM acceleration exclusively. This makes me think that this issue probably is not caused by a bug in QEMU.
Any time you modify memory and the caches can't see the modification you should invalidate the region. Or flush the data cache. The snooping does not happen between instruction and data cache.The GOT is data and the core uses data accesses, so no flushing needed.
Thanks for your reply!
Could you please clarify a bit your 1st sentence? Is it strictly about memory where code is loaded from executed or data as well? And when you say caches, you mean instruction caches? I am also confused by the second sentence - why data cache, did you mean instruction cache?
In the last sentence, you mention "data accesses". Why flushing of data caches where GOT data might be present is not required when coded accessing this data is being executed?
Could you also please confirm this statement/question I made:
"Also generally, is the "self-modifying code" scenario more generic than it sounds and includes any scenario when any code is loaded and mapped by other code even the very 1st time and not necessarily re-loaded (aka modified) like in the case of JIT-like scenarios? Does it mean that effectively dynamic linker falls into the category of "self-modifying code"?"
Thanks,
Waldek