I am currently working on creating a shared virtual address space in Linux arm64 on a Xilinx Zynq Ultrascale+ board. In the future it should be possible to share pointers/addresses between the Cortex A53s and the FPGA utilizing the built in ARM SMMU 500 (IOMMU) and the Cache Coherent Interface (CCI) without any user action necessary.
To do so I used the driver/iommu/arm-smmu.c iommu driver of Linux kernel v4.14.0 and modified it to remove most of virtualization abstractions, along with a separate custom kernel module and ioctls. Thereby each process gets its own individual SMMU context bank which holds its own aarch64 page table. Right now it is already possible to successfully read and write data from the FPGA with virtual addresses using a separate page table by manually mapping the allocated pages to the same virtual addresses again.
It would be more convenient if the MMU and SMMU could share the same page table, thus avoiding the unnecessary setup of a second, redundant page table. To do so, I made the following changes:
- Configure SMMU to use 39 bit VA size and 40 bit PA size (4 KB page size)
- Take the PGD pointer out of curr task_struct and pass it to the correct SMMU context bank PGD entry
All other SMMU hardware configurations are the same as in arm-smmu.c.
However, this leads to non-deterministic behavior. A test case where the FPGA reads multiple values and writes them back to a different location using the shared virtual addresses works only sometimes. I implemented the test program to intentionally pause after setting up all necessary data structures and initializing them and just before the FPGA is instructed to transfer the values. The longer the pause, the more values are correctly transferred. After about 10 seconds of pause the test always concludes successfully. In my opinion this sounds like a cache issue where the updated page table entries (PTE) created by Linux are still in CPU cache and the SMMU accesses the wrong ones which leads to no transfers (but also no translation error/failure in the SMMU). So either I have to flush/clean the cache at the correct location in linux source code or change some SMMU flags (context bank, stream-2-context register, ...), MMU flags or memory/shareability flags of the PTEs.
I already discovered that the MAIR was setup differently for the SMMU. I changed the SMMU code to match the MAIR of the MMU and also use the correct memory attribute index when necessary. But that didn't help. Furthermore, I also checked chapter 1.5.2 "Differences between ARM architecture and SMMU translation schemes" in the ARM SMMU v2 architecture specification.
What confuses me the most is that manually built page tables work correctly, but the Linux generated ones cause the described non-deterministic behavior.
Any information or hints about how to correctly setup the SMMU to use/share page tables generated by Linux or either change the way Linux configures its page tables in order to work properly with the SMMU would be greatly appreciated.
Thank you so much!
I'll start by saying that I'm not a Linux expert, and don't know how it's SMMU driver works. But a couple of thoughts...Caching: Have you set the cacheability of the translation table walks (in SMMU_CBn_TCR) correctly?Demand paging: One problem I can foresee is Linux demand allocating the memory to user space. On the processor, hitting an unallocated or paged-out page will cause the kernel to page it in - a process which appears transparent to user space. On the SMMU side, you can configure the SMMU to return an error or hold the transaction and send an interrupt. If it works after you've run for some time, it's possible the pages have just been allocated at that point.
I think paging could cause you other problems with this approach. Based on my admittedly limited experience of drivers, memory used for an external master is pinned (not paged) as external masters tend to be less tolerant of paging. This approach would either require all of the user space task's memory to be pinned or for the SMMU driver to deal with paging.
What you describe is called Shared Virtual Addressing (SVA) or Shared Virtual Memory (SVM).
See for example https://lwn.net/Articles/747230/