I read https://developer.arm.com/documentation/ihi0070/latest/ but couldn't find any details about how it deals with huge pages.
Let's say I wanted to speed up my accelerator program of my custom SoC by (asking to and successfully) allocating a 256MB page (as I know I'm going to be accessing that range a lot). Does the SMMU have any optimizations built in to take advantage of that? My thought is that it would only need 1 translation entry to access any of the 256MB page. But I'm not sure that's how it works in relation to translation granules supported (4KB, 16KB and 64KB).
At the architecture level, it's not really that different to the story on the CPU.
The SMMU architecture supports the same granule, block and page sizes as in the CPU architecture (to allow sharing of tables). There is also the contiguous hint, so that software can tell hardware that an aligned block of 'n' translations are contiguous in physical as well as virtual address space.
For the CPU or SMMU, software would ideally map the allocated memory using the largest block/page that was available - possibly also using the contig hint.
Now, what a given CPU or SMMU implementation will actually do in response is micro-architecture and could vary.
Gotcha. I perhaps posted this in the wrong forum. I wanted to know what ARM's MMU-700 does with large pages and whether there are optimizations to save TBU entries when encountering them.
I'm not familiar with the MMU-700, but looking at the TRM it says:
"Optimization enables storage of all architecturally‐defined page and block sizes, including contiguous page and block entries, as a single entry in the TBU and TCU TLBs (WCs)"
https://developer.arm.com/documentation/101542/0102/Overview-of-MMU-700/Features?lang=en
So it seems the answer is that as long as software does the right thing in terms of mappings and contig bits then the MMU-700 can take advantage of it
Awesome, thanks!