We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Dear all,
I am working on an ARM-V8 server with two gpu cards on it. Recently, I need to test pcie peer to peer communication between the two gpu cards, but the throughput is only 4GB/s.
After I explored the gpu's kernel mode driver, I found it was using the dma_map_resource() API to map the peer device's MMIO space. The arm iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys, size_t size, enum dma_data_direction dir, unsigned long attrs) { return __iommu_dma_map(dev, phys, size, dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO, dma_get_mask(dev)); }
And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute in PTE, which may have a negative impact on the performance of the pcie peer to peer transactions.
/* * Note that this logic is structured to accommodate Mali LPAE * having stage-1-like attributes but stage-2-like permissions. */ if (data->iop.fmt == ARM_64_LPAE_S2 || data->iop.fmt == ARM_32_LPAE_S2) { if (prot & IOMMU_MMIO) pte |= ARM_LPAE_PTE_MEMATTR_DEV; else if (prot & IOMMU_CACHE) pte |= ARM_LPAE_PTE_MEMATTR_OIWB; else pte |= ARM_LPAE_PTE_MEMATTR_NC; } else { if (prot & IOMMU_MMIO) pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV << ARM_LPAE_PTE_ATTRINDX_SHIFT); else if (prot & IOMMU_CACHE) pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE << ARM_LPAE_PTE_ATTRINDX_SHIFT); }
I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource() API and re-compile the linux kernel, the throughput then can be up to 28GB/s.
Is there an elegant way to solve this issue without modifying the linux kernel? e.g., a substitution of dma_map_resource() API?
Thank you!
Linux kernel version: 5.10
PCIE GEN4 x16
www.spinics.net/.../msg4944197.html