Dear all,
I am interested in a scenario where I want to host two guest OSes above a bare-metal hypervisor on an ARM mobile platform. The total available memory platform is 4GB where I want to expose exclusively 2 GB of continuous RAM to each guest OS. Could you please guide me through my two below concerns:
1- in case I change the FDT (device tree) of each guest OS to reflect exclusively 2 GB of continuous memory, can I be assured that the kernel of the guest OS will only access these 2 GB, and further it is not even aware of the existence of the other 2 GB of the memory on the platform?
2- I prefer that each guest OS manage directly its memory without the intervention of the host hypervisor (in other words the guest physical address reflect the actual hypervisor physical address), in such scenario can I resort to a XenARM-like hypervisor and just disable the second stage translation in the Xen hypervisor code? Would that work or is there actually a better way to do it? Please share your experience
Best wishes.
Bare metal hypervisors are type 1 hypervisors and ARM provides very good support for them in ARMV7 and later The way it deals address translation isto have two stages - the hypervisor sets up one stage which then looks like bare metal as far as the guest OS is concerned. Addresses are then translated by the tables in the guest and then by the hypervisor one. One could set up the hypervisor tables so each guest had a view which looked like part of the real memory and had the same address as the physical memory. However it sounds like you want to remove the hypervisor tables altogether and still have full protection between the guest OSs.
I guess that might just be possible if the hypevisor protects the areas a guest OS uses for its tables and checks any changes. Or it could do paravirtualization where any guest has to call it everytime it wants to do anything like that. It sounds like a lot more work. I suppose there is some saving in address translation but not all that much as the stages are separately cached. Is that really your intention?
Thanks Daith,
Yes, my intention is to remove the hypervisor translation tables so that the output of the first stage translation (intermediate physical address IPA) is directly equivalent to the machine physical address i.e. no hypervisor second stage translation. And further, each of the two guest OSes can see/access only portion of the physical memory by modifying the device tree. My point is not only for performance (though performance is one of the benefits I believe ).
I am only interested in unmodified kernel, without any kind of paravirtualization.
What do you exactly mean by stages are separately changed? is that TLB translations? in such case wouldn't performance depend on the workload memory access patterns? e.g. regularity and locality of memory accesses?
Please let me know (ARM community) what do you exactly think about my initial two questions as those are very fundamental for my project. My initial experimentation and slight code modification seem somehow encouraging but not very sure
Thank you so much.
One of the advantages of stage 2 translation is that it prevents a guest (accidentally or maliciously) accessing the resources of other guests or the Hypervisor. Without it you are relying on the two guests being well behaved. I think it fair to say that it is unlikely that an OS would map in addresses that it believed didn't have anything at them, but certainly not impossible. The question becomes why are you using virtualization to give you two guests in the first place? If, for example, it was for sandboxing then relying on them being well behaved doesn't seem like a great idea.
Second stage translation does not need to be a big overhead, and it doesn't stop you from having flat mapped addresses (i.e. IPA==PA). You said you want to give each guest 2GB of contiguous RAM. The translation table format gives the option of 1GB blocks, so that's just two entries.
The overheads of the extra translation are very small, the second stage will be cached and the whole translation will be cached after the first access. What you are talking about is a very difficult thing to do which is the reason the hardware was put in. To ensure the guest OSs do not access other areas you'd need to make any translation tables inaccessible to the guest OS and interpret any access in the hypervisor. As opposed to that the TLB does not even have to be flushed when switching between guest OS's as the VMID can be used to identify each OS.
Thanks daith and Martin,
@daith: yes I agree that the overhead of second stage translation translation is very small and that's why hardware second stage page-table walk were made (let's forget about how much very small ). Further, you mentioned that my goal is very difficult thing to do; why is that in your opinion? what are the challenges that I will be facing?
Please let me state briefly that my goal is to host two OSes with pseudo-complete disengaging of the hypervisor. The scheduled OS is itself responsible for its own memory mgmt, interrupt handling and all other privileged operations. My bare-metal hypervisor does nothing by periodically schedule one of the two OSes based on a timer interrupt. Any kernel access to this timer is trapped to the hyp etc..
Martin idea on having flat having flat mapped addresses (i.e. IPA==PA) together with hyp translation is very interesting, could you please elaborate more on how to integrate it in my design to achieve my upper-mentioned design goal.
The hypervisor would have to inspect and monitor all accesses to the page tables of the guest OS if it did not provide a second level of translation. Otherwise the guest OS can just stick in the physical address of an area it should not access or try accessing some I/O device it should not access. This would probably mean checking the page tables when the guest OS loads a translation table base and protecting it from reads and writes. It would then need to catch the interrupt when any instruction accessed a table and emulate the instruction after checking that it was admissible - i.e. that it only allowed access to area the guest OS was supposed to be able to access, or for a read that it got back what it expected rather than detecting any changes by the hypervisor to disable access.
The second level of translation is invisible to the guest OS. It provides a virtual environment for the OS so even though it can set up and change page tables what it thinks of as a physical adress is in fact checked and translated by the second level set up by the hypervisor.
By the way here's some figures on overheads
Xen on ARM - How fast is it really?
As you can see the double translation isn't a problem, the overheads are in problems virtualizing interrupts and I/O.