This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AArch64 TLB maintenance requirements

Hello all,
I want to improve VM operation in AArch64 port of FreeBSD but I stuck on following problem.
The FreeBSD VM subsystem is capable to map various *kernel* objects by using superpage (higher order) mapping. But in lifecycle of these objects (because of COW or so), we must be able to break these superpage mappings back into normal (lower order) pages. Unfortunately, these objects can contain vital kernel data (kernel stack of other threads, etc) so we must be able to do this operation in atomic manner without doing standard break-before-make approach – in SMP environment it’s impossible to temporary unmap these object and any attempt to use some sort of serialization is contra productive.

Let me to give you exact example:
Assume that we talking about stage 1 translation only, 4kB translation granule, contiguous bit is not used. I have 2MB level 2 block mapping and I want to break it into equivalent (by size, type and attributes) page mapping. So system prepare fully populated level 3 page table with equivalent page table entries, then atomically swaps level2 block entry with appropriate page table pointer entry and do flush TLB.
The above approach looks safe for me, if given PE have block mapping already cached in TLB then it use it for any address within 2MB block, if not then it do table walk and uses new table entry. Also, this cannot confuse any already running page table walks on this or other PE. Here is nothing that can lead into multiple TLB entries undefined behavior.

But “D4.10.1 General TLB maintenance requirements“ of AArch64 ARM confuses me. Only (loosely) related part of this chapter is:
----------------------------------------------------------------------
Using break-before-make when updating translation table entries:
To avoid possibly creating multiple TLB entries for the same address … the architecture requires the use of a break-before-make sequence when changing translation table entries whenever multiple threads of execution can use the same translation tables and the change to the translation table entries involves any of:

- A change to the size of block used by the translation system. This applies both:
— When changing from a smaller size to a larger size, for example by replacing a table mapping with a block mapping in a stage 2 translation table.
— When changing from a larger size to a smaller size, for example by replacing a block mapping with a table mapping in a stage 2 translation table.
----------------------------------------------------------------------

There are some confusing items.
- What exactly “size of block used by the translation system” is? Size of mapped memory by given entry/table? Or size of entry/table itself?
- Why both paragraphs explicitly mentions “stage 2”? Does this meant that these are applicable to stage 2 only? I’m pretty sure that are applicable also for stage 1. Moreover, replacing a table mapping with a block mapping require break-before-make in any case, even if block sizes are equal.

So, please, can anybody confirm that proposed approach (replace block mapping with equivalent table mapping) without break-before-make is safe from architectural point of view? Or, if break-before-make approach is necessary by AArch64 ARM, can you give me example of failing path?

Many thanks,
Michal Meloun

Parents
  • First, sorry for delayed response. Last weekend was been slightly hectic for me.

    Then, the size of this range == size of the block.

    Yeah, I see, THANKS!!! This is exactly what’s I want. It's so obvious and yet I could not understand it :(

    So yes, “size of block used by the translation system” is size of mapping provided by single TLB entry (which can differ from size of mapping provided by block or table entry because of continuous block).
    Thank you so much again, I really needed this hint.

    Since armv8.1 onwards, the hardware can (optionally) update the Access and the Dirty flags of the translation table entry. Not following BBM could possibly cause disagreement between the PEs as to the type of the desc held. The "Using break-before-make..." subsection says the hardware is allowed to fail in its management of such flags (if BBM is not followed). This looks like a reason to not avoid BBM even for an equivalent demotion.

    At this time we don’t use it. Moreover, in FreeBSD, all kernel pages are mapped as RO + accessed or RW + dirty + accessed. So I think that we can safely assume that that these bits was never be set/reset by HW.
    I cannot do nothing that call combination of hardware update and continuous block  as “real devil”. And believe me, I will always stick as far as possible from the system that uses these properties at once :)

    If I can narrow my original question. Assume that none of original block or new page translation entries are part of continuous block and HW updates are off.

    How my very limited case of “demotion” of block mapping without BBM can leads to multiple TLB match problem?


    If is given block mapping already fetched in TLB, then this TLB entry cover full affected range and TTR have not reason to do another page table walk. If not or if is evicted (by any reason – speculation, explicit flush from other PE, …) then store/load ordering ensures that given TTR cannot get incomplete data or fetch old block mapping back.
    This is very important for me, we uses same mechanism for ARMv7 for years, without any issue (yes, this is not an argument, I known).

    All Linux issues from your links are related to manipulation with continous blocks, so BBM is the only valid solution, that is clear.

Reply
  • First, sorry for delayed response. Last weekend was been slightly hectic for me.

    Then, the size of this range == size of the block.

    Yeah, I see, THANKS!!! This is exactly what’s I want. It's so obvious and yet I could not understand it :(

    So yes, “size of block used by the translation system” is size of mapping provided by single TLB entry (which can differ from size of mapping provided by block or table entry because of continuous block).
    Thank you so much again, I really needed this hint.

    Since armv8.1 onwards, the hardware can (optionally) update the Access and the Dirty flags of the translation table entry. Not following BBM could possibly cause disagreement between the PEs as to the type of the desc held. The "Using break-before-make..." subsection says the hardware is allowed to fail in its management of such flags (if BBM is not followed). This looks like a reason to not avoid BBM even for an equivalent demotion.

    At this time we don’t use it. Moreover, in FreeBSD, all kernel pages are mapped as RO + accessed or RW + dirty + accessed. So I think that we can safely assume that that these bits was never be set/reset by HW.
    I cannot do nothing that call combination of hardware update and continuous block  as “real devil”. And believe me, I will always stick as far as possible from the system that uses these properties at once :)

    If I can narrow my original question. Assume that none of original block or new page translation entries are part of continuous block and HW updates are off.

    How my very limited case of “demotion” of block mapping without BBM can leads to multiple TLB match problem?


    If is given block mapping already fetched in TLB, then this TLB entry cover full affected range and TTR have not reason to do another page table walk. If not or if is evicted (by any reason – speculation, explicit flush from other PE, …) then store/load ordering ensures that given TTR cannot get incomplete data or fetch old block mapping back.
    This is very important for me, we uses same mechanism for ARMv7 for years, without any issue (yes, this is not an argument, I known).

    All Linux issues from your links are related to manipulation with continous blocks, so BBM is the only valid solution, that is clear.

Children
  • I can't think of an example to give.

    Your point of view and experience is indeed shared by others: for instance, one of the links admits of an assumption that there's (very) little risk of a TLB conflict when splitting/demoting a block entry into an equivalent table entry. It also says that the assumption is not endorsed by the manual/architecture, and that it is complicated further if the block sizes differ between stage1 and stage2 translations.

    UBoot's armv8 split_block function runs under the same assumption.

    -------

    Since the architecture asks for BBM during demotion, it does not forbid an implementation which breaks when the requirement is violated (admittedly, only for a short duration until the tlb invalidation arrives).

    It seems that the consideration of the violation is noteworthy when the hardware is permitted to /write/ to the entries in memory (which then races with the software's intentions). 

  • You're right.
    The architecture simply require BBM during demotion and is not reasonable to deny, ignore it. Even though my mind wants opposite result, even if all my testing passed :)
    FreeBSD is not my own hobby project, there is no space for violating architecture rules.
    Anyway, many thanks for your effort and help, and I apologize for my stubbornness.