Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
From: Ryan Roberts
Date: Mon Apr 15 2024 - 11:53:47 EST
On 15/04/2024 16:22, David Hildenbrand wrote:
> On 15.04.24 17:17, Ryan Roberts wrote:
>> On 15/04/2024 15:58, David Hildenbrand wrote:
>>> On 15.04.24 16:34, Ryan Roberts wrote:
>>>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>>>
>>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>>>> TLBIs). But
>>>>>>>>>> you make that clear further down.
>>>>>>>>>
>>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>>>> tables
>>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific
>>>>>>>>> and
>>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>>>
>>>>>>>>>>> I think what has to happen is:
>>>>>>>>>>>
>>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as
>>>>>>>>>>> long as
>>>>>>>>>>> there
>>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can
>>>>>>>>>>> handle
>>>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>>>
>>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>>>> about a
>>>>>>>>>> class of users. I agree with the statement.
>>>>>>>>>
>>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as
>>>>>>>>> page
>>>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot
>>>>>>>>>>> handle
>>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>>>
>>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check
>>>>>>>>>>> later if
>>>>>>>>>>> the
>>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>>>> flushes
>>>>>>>>>>> required. This is the common case. HW might concurrently set
>>>>>>>>>>> access/dirty
>>>>>>>>>>> bits,
>>>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>>>
>>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually
>>>>>>>>>> care
>>>>>>>>>> about
>>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>>>> guarranteed
>>>>>>>>>> to be self-consistent except for access and dirty information, which
>>>>>>>>>> may be
>>>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>>>
>>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>>>> that
>>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>>>> there
>>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>>>
>>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@xxxxxxxxxx
>>>>>>>>>
>>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>>>
>>>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This could mean that the access/dirty state *does* change for a given
>>>>>>>>>> page
>>>>>>>>>> while
>>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>>>> *think*
>>>>>>>>>> that failing to detect this is benign.
>>>>>>>>>
>>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>>>> check. So
>>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>>>
>>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>>>> soft-dirty or uffd-wp).
>>>>>>>>
>>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>>>> allow
>>>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>>>> implemented
>>>>>>>> for arm64 today.
>>>>>>>
>>>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>>>> would be able to catch that.
>>>>>>>
>>>>>>> Would that be possible?
>>>>>>
>>>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>>>
>>>>>>
>>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs
>>>>>> 0, 1,
>>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>>>> stored in
>>>>>> PTE0 for the contpte block, and it is dirty.
>>>>>>
>>>>>> Now let's say there are 2 racing threads:
>>>>>>
>>>>>> - ThreadA is doing a GUP-fast for PTE3
>>>>>> - ThreadB is remapping order-0 FolioB at PTE0
>>>>>>
>>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>>>
>>>>>> ThreadA ThreadB
>>>>>> ======= =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>> pte1 = ptep_get_lockless(PTE3)
>>>>>> READ_ONCE(PTE3)
>>>>>> mmap(PTE0)
>>>>>> clear_pte(PTE0)
>>>>>> unfold(PTE0 - PTE3)
>>>>>> WRITE_ONCE(PTE0, 0)
>>>>>> WRITE_ONCE(PTE1, 0)
>>>>>> WRITE_ONCE(PTE2, 0)
>>>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>>> READ_ONCE(PTE1) (for a/d)
>>>>>> READ_ONCE(PTE2) (for a/d)
>>>>>> READ_ONCE(PTE3) (for a/d)
>>>>>> <do speculative work with pte1 content>
>>>>>> pte2 = ptep_get_lockless(PTE3)
>>>>>> READ_ONCE(PTE3)
>>>>>> READ_ONCE(PTE0) (for a/d)
>>>>>> READ_ONCE(PTE1) (for a/d)
>>>>>> READ_ONCE(PTE2) (for a/d)
>>>>>> READ_ONCE(PTE3) (for a/d)
>>>>>> true = pte_same(pte1, pte2)
>>>>>> WRITE_ONCE(PTE3, 0)
>>>>>> TLBI
>>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE0, 0)
>>>>>> set_pte_at(PTE0, <new>)
>>>>>>
>>>>>> This example shows how a *false-negative* can be returned for the dirty
>>>>>> state,
>>>>>> which isn't detected by the check.
>>>>>>
>>>>>> I've been unable to come up with an example where a *false-positive* can be
>>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>>>> this second example, let's assume everything is the same execpt FolioA is
>>>>>> initially clean:
>>>>>>
>>>>>> ThreadA ThreadB
>>>>>> ======= =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>> pte1 = ptep_get_lockless(PTE3)
>>>>>> READ_ONCE(PTE3)
>>>>>> mmap(PTE0)
>>>>>> clear_pte(PTE0)
>>>>>> unfold(PTE0 - PTE3)
>>>>>> WRITE_ONCE(PTE0, 0)
>>>>>> WRITE_ONCE(PTE1, 0)
>>>>>> WRITE_ONCE(PTE2, 0)
>>>>>> WRITE_ONCE(PTE3, 0)
>>>>>> TLBI
>>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>> WRITE_ONCE(PTE0, 0)
>>>>>> set_pte_at(PTE0, <new>)
>>>>>> write to FolioB - HW sets PTE0's dirty
>>>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>>> READ_ONCE(PTE1) (for a/d)
>>>>>> READ_ONCE(PTE2) (for a/d)
>>>>>> READ_ONCE(PTE3) (for a/d)
>>>>>> <do speculative work with pte1 content>
>>>>>> pte2 = ptep_get_lockless(PTE3)
>>>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB
>>>>>> READ_ONCE(PTE0) (for a/d)
>>>>>> READ_ONCE(PTE1) (for a/d)
>>>>>> READ_ONCE(PTE2) (for a/d)
>>>>>> READ_ONCE(PTE3) (for a/d)
>>>>>> false = pte_same(pte1, pte2) << So this fails
>>>>>>
>>>>>> The only way I can see false-positive not being caught in the second
>>>>>> example is
>>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA
>>>>>> scenario.
>>>>>> But these lockless table walkers are already suseptible to that.
>>>>>>
>>>>>> I think all the same arguments can be extended to the access bit.
>>>>>>
>>>>>>
>>>>>> For me this is all getting rather subtle and difficult to reason about and
>>>>>> even
>>>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>>>
>>>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>>>> for
>>>>>> access and dirty information, which may be inconsistent if a racing
>>>>>> modification
>>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>>>> Only
>>>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>>>
>>>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>>>> this
>>>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>>>
>>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>>>> would sync against GUP-fast?
>>>>
>>>> No, the broadcast is in HW. There is no IPI.
>>>
>>> Okay ...
>>>
>>> I agree that the semantics are a bit weird, but if we could get rid of
>>> ptep_get_lockless() on arm64, that would also be nice.
>>>
>>>
>>> Something I've been thinking of ... just to share what I've had in mind. The two
>>> types of users we currently have are:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>>>
>>> What if we had the following instead:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>>> PTL.
>>>
>>> And on arm64 let
>>>
>>> (1) ptep_get_lockless() be ptep_get()
>>>
>>> (2) ptep_get_gup_fast() be __ptep_get().
>>>
>>> That would mean, that (2) would not care if another cont-pte is dirty, because
>>> we don't collect access+dirty bits. That way, we avoid any races with concurrent
>>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
>>> would have to set all cont-PTEs dirty, even if any of these already is dirty.
>>
>> I don't think the "problematic" thing is actually a problem; set_ptes() will
>> always set the dirty bit to the same value for all ptes it covers (and if you do
>> set_ptes() on a partial contpte block, it will be unfolded first). Although I
>> suspect I've misunderstood what you meant there...
>
> It's more code like that following that I am concerned about.
>
> if (pte_dirty()) {
> /* Great, nothing to do */
> } else
> mte_mkdirty();
> set_ptes();
> ...
> }
OK I see, so you're worried about uneccessary unfolding that the false-negative
dirty reporting could cause? I think the best solution there would be for the
core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in
his series at [1]. That would avoid any unfolding and just dirty all contpte
block(s) touched by the range.
[1] https://lore.kernel.org/linux-mm/20240413002219.71246-1-ioworker0@xxxxxxxxx/
>
>>
>> The potential problem I see with this is that the Arm ARM doesn't specify which
>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>> randomly and this could spuriously increase your check failure rate. In reality
>> I believe most implementations will update the PTE for the address that caused
>> the TLB to be populated. But in some cases, you could have eviction (due to
>> pressure or explicit invalidation) followed by re-population due to faulting on
>> a different page of the contpte block. In this case you would see this type of
>> problem too.
>>
>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
>> potentially false-negatives for access and dirty? Just with a much higher chance
>> of getting a false-negative. How is this helping?
>
> You are performing an atomic read like GUP-fast wants you to. So there are no
> races to worry about like on other architectures: HW might *set* the dirty bit
> concurrently, but that's just fine.
But you can still see false-negatives for access and dirty...
>
> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.
And I think I convinced myself that you will only see false-negatives with
today's arm64 ptep_get(). But an order or magnitude fewer than with your
proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).
>
> To me that sounds ... much simpler ;) But again, just something I've been
> thinking about.
OK so this approach upgrades my "I'm fairly sure we never see false-positives"
to "we definitely never see false-positives". But it certainly increases the
quantity of false-negatives.
>
> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
> choice.
>
If you want to go down the ptep_get_gup_fast() route, you've still got to be
able to spec it, and I think it will land pretty close to my most recent stab at
respec'ing ptep_get_lockless() a couple of replies up on this thread.
Where would your proposal leave the KVM use case? If you call it
ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
be left with ptep_get()...
Sorry this thread is getting so long. Just to summarise, I think there are
currently 3 solutions on the table:
- ptep_get_lockless() remains as is
- ptep_get_lockless() wraps ptep_get()
- ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)
Based on discussion so far, that's also the order of my preference.
Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
I think its just the potential for looping in the face of concurrent modifications?