This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AArch64 TLB maintenance requirements

Hello all,
I want to improve VM operation in AArch64 port of FreeBSD but I stuck on following problem.
The FreeBSD VM subsystem is capable to map various *kernel* objects by using superpage (higher order) mapping. But in lifecycle of these objects (because of COW or so), we must be able to break these superpage mappings back into normal (lower order) pages. Unfortunately, these objects can contain vital kernel data (kernel stack of other threads, etc) so we must be able to do this operation in atomic manner without doing standard break-before-make approach – in SMP environment it’s impossible to temporary unmap these object and any attempt to use some sort of serialization is contra productive.

Let me to give you exact example:
Assume that we talking about stage 1 translation only, 4kB translation granule, contiguous bit is not used. I have 2MB level 2 block mapping and I want to break it into equivalent (by size, type and attributes) page mapping. So system prepare fully populated level 3 page table with equivalent page table entries, then atomically swaps level2 block entry with appropriate page table pointer entry and do flush TLB.
The above approach looks safe for me, if given PE have block mapping already cached in TLB then it use it for any address within 2MB block, if not then it do table walk and uses new table entry. Also, this cannot confuse any already running page table walks on this or other PE. Here is nothing that can lead into multiple TLB entries undefined behavior.

But “D4.10.1 General TLB maintenance requirements“ of AArch64 ARM confuses me. Only (loosely) related part of this chapter is:
----------------------------------------------------------------------
Using break-before-make when updating translation table entries:
To avoid possibly creating multiple TLB entries for the same address … the architecture requires the use of a break-before-make sequence when changing translation table entries whenever multiple threads of execution can use the same translation tables and the change to the translation table entries involves any of:
…
- A change to the size of block used by the translation system. This applies both:
— When changing from a smaller size to a larger size, for example by replacing a table mapping with a block mapping in a stage 2 translation table.
— When changing from a larger size to a smaller size, for example by replacing a block mapping with a table mapping in a stage 2 translation table.
----------------------------------------------------------------------

There are some confusing items.
- What exactly “size of block used by the translation system” is? Size of mapped memory by given entry/table? Or size of entry/table itself?
- Why both paragraphs explicitly mentions “stage 2”? Does this meant that these are applicable to stage 2 only? I’m pretty sure that are applicable also for stage 1. Moreover, replacing a table mapping with a block mapping require break-before-make in any case, even if block sizes are equal.

So, please, can anybody confirm that proposed approach (replace block mapping with equivalent table mapping) without break-before-make is safe from architectural point of view? Or, if break-before-make approach is necessary by AArch64 ARM, can you give me example of failing path?

Many thanks,
Michal Meloun

Top replies

a.surati over 6 years ago in reply to Michal Meloun +1 verified

As per the manual, break-before-make (BBM) is required during demotion (which they refer to as the "change of block size from large to small"). Some more info here . Based on my understanding of the...

0 42Bastian Schick over 6 years ago

At first: I just start to dig into AArch64 MMU architecture, so I might be totally wrong.

Michal Meloun said:
Assume that we talking about stage 1 translation only, 4kB translation granule, contiguous bit is not used. I have 2MB level 2 block mapping and I want to break it into equivalent (by size, type and attributes) page mapping. So system prepare fully populated level 3 page table with equivalent page table entries, then atomically swaps level2 block entry with appropriate page table pointer entry and do flush TLB.

This sounds safe if the mapping is used only on one PE. But if you have threads on different PEs which use the same (process wide) mapping, one PE uses a fine grained mapping and the other the coarse one.

So a TLB flush on one PE will have effect on different TLB entries.

IMHO, you need to invalidate first the TLB entry (which will propagate to the other PEs AFAIK). Then switch block to level 3 pointer. I see no problem with this _if_ you use _one_ page table per process (no matter on which PE). If you have local page tables, then ... no idea.
Cancel
Vote up 0 Vote down

Cancel
0 a.surati over 6 years ago

Would break-before-make (i.e. temporary unmapping), with the other PEs spinning on inter processor interrupts while the demotion/promotion is being carried out, work?
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 6 years ago in reply to a.surati

I see no reason why not. I think, it is important to invalidate the TLB before setting the new PTE.
Cancel
Vote up 0 Vote down

Cancel
0 Michal Meloun over 6 years ago in reply to 42Bastian Schick

No, I afraid that this is not a right approach. Flushed TLB entry can be loaded back immediately after flush, as result of speculative access on any PE, or by regular access by other PE.
Cancel
Vote up 0 Vote down

Cancel
0 Michal Meloun over 6 years ago in reply to a.surati

Of course, I can do this. But cost of synchronous IPI is very high and I’m pretty sure that it simply negates benefit of section mapping (lowering pressure to TLB).
Cancel
Vote up 0 Vote down

Cancel
0 Michal Meloun over 6 years ago

I read D4.10.1 again and again, but can't put all this pieces together :(
Simply
- what is meant by sentence “size of block used by the translation system”.
- does require architecture use of a break-before-make sequence for pure page demotion? If yes, why?
Again, I fully understand that break-before-make is required for promotion but I don’t see single reason why it should be necessary also for demotion.
Cancel
Vote up 0 Vote down

Cancel
+1 a.surati over 6 years ago in reply to Michal Meloun

As per the manual, break-before-make (BBM) is required during demotion (which they refer to as the "change of block size from large to small"). Some more info here.

Based on my understanding of the manual, it seems that the facilities provided by the Contiguous bit, and by the block descriptors, can be implemented through a single, common mechanism of storing a single TLB entry which covers a contiguous VA-PA mapping range.

Then, the size of this range == size of the block.

By resetting the Contiguous bit, or by replacing a block descriptor with a table descriptor, we change the block size from large to small.

By setting the Contiguous bit, or by replacing a table descriptor with a block descriptor, we change the block size from small to large.

Since armv8.1 onwards, the hardware can (optionally) update the Access and the Dirty flags of the translation table entry. Not following BBM could possibly cause disagreement between the PEs as to the type of the desc held. The "Using break-before-make..." subsection says the hardware is allowed to fail in its management of such flags (if BBM is not followed). This looks like a reason to not avoid BBM even for an equivalent demotion.

The varying levels of TTRem support seem to relax the requirement of needing BBM when modifying the size of the block alone.

Edit: About BBM in Linux.

Edit2: More discussion about BBM: here, here, here, here, here.

Edit3: Arm's (Mark R) response to uboot's lack of BBM in its split_block.
Cancel
Vote up +1 Vote down

Cancel
0 Michal Meloun over 6 years ago in reply to a.surati

First, sorry for delayed response. Last weekend was been slightly hectic for me.

a.surati said:
Then, the size of this range == size of the block.

Yeah, I see, THANKS!!! This is exactly what’s I want. It's so obvious and yet I could not understand it :(

So yes, “size of block used by the translation system” is size of mapping provided by single TLB entry (which can differ from size of mapping provided by block or table entry because of continuous block).
Thank you so much again, I really needed this hint.

a.surati said:
Since armv8.1 onwards, the hardware can (optionally) update the Access and the Dirty flags of the translation table entry. Not following BBM could possibly cause disagreement between the PEs as to the type of the desc held. The "Using break-before-make..." subsection says the hardware is allowed to fail in its management of such flags (if BBM is not followed). This looks like a reason to not avoid BBM even for an equivalent demotion.

At this time we don’t use it. Moreover, in FreeBSD, all kernel pages are mapped as RO + accessed or RW + dirty + accessed. So I think that we can safely assume that that these bits was never be set/reset by HW.
I cannot do nothing that call combination of hardware update and continuous block as “real devil”. And believe me, I will always stick as far as possible from the system that uses these properties at once :)

If I can narrow my original question. Assume that none of original block or new page translation entries are part of continuous block and HW updates are off.

How my very limited case of “demotion” of block mapping without BBM can leads to multiple TLB match problem?

If is given block mapping already fetched in TLB, then this TLB entry cover full affected range and TTR have not reason to do another page table walk. If not or if is evicted (by any reason – speculation, explicit flush from other PE, …) then store/load ordering ensures that given TTR cannot get incomplete data or fetch old block mapping back.
This is very important for me, we uses same mechanism for ARMv7 for years, without any issue (yes, this is not an argument, I known).

All Linux issues from your links are related to manipulation with continous blocks, so BBM is the only valid solution, that is clear.
Cancel
Vote up 0 Vote down

Cancel
0 a.surati over 6 years ago in reply to Michal Meloun

I can't think of an example to give.

Your point of view and experience is indeed shared by others: for instance, one of the links admits of an assumption that there's (very) little risk of a TLB conflict when splitting/demoting a block entry into an equivalent table entry. It also says that the assumption is not endorsed by the manual/architecture, and that it is complicated further if the block sizes differ between stage1 and stage2 translations.

UBoot's armv8 split_block function runs under the same assumption.

-------

Since the architecture asks for BBM during demotion, it does not forbid an implementation which breaks when the requirement is violated (admittedly, only for a short duration until the tlb invalidation arrives).

It seems that the consideration of the violation is noteworthy when the hardware is permitted to /write/ to the entries in memory (which then races with the software's intentions).
Cancel
Vote up 0 Vote down

Cancel
0 Michal Meloun over 6 years ago in reply to a.surati

You're right.
The architecture simply require BBM during demotion and is not reasonable to deny, ignore it. Even though my mind wants opposite result, even if all my testing passed :)
FreeBSD is not my own hobby project, there is no space for violating architecture rules.
Anyway, many thanks for your effort and help, and I apologize for my stubbornness.
Cancel
Vote up 0 Vote down

Cancel