Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
From: Ryan Roberts
Date: Mon Apr 15 2024 - 11:17:52 EST
On 15/04/2024 15:58, David Hildenbrand wrote:
> On 15.04.24 16:34, Ryan Roberts wrote:
>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>
>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>> TLBIs). But
>>>>>>>> you make that clear further down.
>>>>>>>
>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>> tables
>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>
>>>>>>>>> I think what has to happen is:
>>>>>>>>>
>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>>> there
>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>
>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>> about a
>>>>>>>> class of users. I agree with the statement.
>>>>>>>
>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>
>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>>> the
>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>> flushes
>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>>> bits,
>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>
>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>>> about
>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>> guarranteed
>>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>
>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>> that
>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>> there
>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>
>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@xxxxxxxxxx
>>>>>>>
>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>
>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>
>>>>>>>>
>>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>>> while
>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>> *think*
>>>>>>>> that failing to detect this is benign.
>>>>>>>
>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>> check. So
>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>
>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>> soft-dirty or uffd-wp).
>>>>>>
>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>> allow
>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>> implemented
>>>>>> for arm64 today.
>>>>>
>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>> would be able to catch that.
>>>>>
>>>>> Would that be possible?
>>>>
>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>
>>>>
>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>> stored in
>>>> PTE0 for the contpte block, and it is dirty.
>>>>
>>>> Now let's say there are 2 racing threads:
>>>>
>>>> - ThreadA is doing a GUP-fast for PTE3
>>>> - ThreadB is remapping order-0 FolioB at PTE0
>>>>
>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>
>>>> ThreadA ThreadB
>>>> ======= =======
>>>>
>>>> gup_pte_range()
>>>> pte1 = ptep_get_lockless(PTE3)
>>>> READ_ONCE(PTE3)
>>>> mmap(PTE0)
>>>> clear_pte(PTE0)
>>>> unfold(PTE0 - PTE3)
>>>> WRITE_ONCE(PTE0, 0)
>>>> WRITE_ONCE(PTE1, 0)
>>>> WRITE_ONCE(PTE2, 0)
>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>> READ_ONCE(PTE1) (for a/d)
>>>> READ_ONCE(PTE2) (for a/d)
>>>> READ_ONCE(PTE3) (for a/d)
>>>> <do speculative work with pte1 content>
>>>> pte2 = ptep_get_lockless(PTE3)
>>>> READ_ONCE(PTE3)
>>>> READ_ONCE(PTE0) (for a/d)
>>>> READ_ONCE(PTE1) (for a/d)
>>>> READ_ONCE(PTE2) (for a/d)
>>>> READ_ONCE(PTE3) (for a/d)
>>>> true = pte_same(pte1, pte2)
>>>> WRITE_ONCE(PTE3, 0)
>>>> TLBI
>>>> WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE0, 0)
>>>> set_pte_at(PTE0, <new>)
>>>>
>>>> This example shows how a *false-negative* can be returned for the dirty state,
>>>> which isn't detected by the check.
>>>>
>>>> I've been unable to come up with an example where a *false-positive* can be
>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>> this second example, let's assume everything is the same execpt FolioA is
>>>> initially clean:
>>>>
>>>> ThreadA ThreadB
>>>> ======= =======
>>>>
>>>> gup_pte_range()
>>>> pte1 = ptep_get_lockless(PTE3)
>>>> READ_ONCE(PTE3)
>>>> mmap(PTE0)
>>>> clear_pte(PTE0)
>>>> unfold(PTE0 - PTE3)
>>>> WRITE_ONCE(PTE0, 0)
>>>> WRITE_ONCE(PTE1, 0)
>>>> WRITE_ONCE(PTE2, 0)
>>>> WRITE_ONCE(PTE3, 0)
>>>> TLBI
>>>> WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>> WRITE_ONCE(PTE0, 0)
>>>> set_pte_at(PTE0, <new>)
>>>> write to FolioB - HW sets PTE0's dirty
>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>> READ_ONCE(PTE1) (for a/d)
>>>> READ_ONCE(PTE2) (for a/d)
>>>> READ_ONCE(PTE3) (for a/d)
>>>> <do speculative work with pte1 content>
>>>> pte2 = ptep_get_lockless(PTE3)
>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB
>>>> READ_ONCE(PTE0) (for a/d)
>>>> READ_ONCE(PTE1) (for a/d)
>>>> READ_ONCE(PTE2) (for a/d)
>>>> READ_ONCE(PTE3) (for a/d)
>>>> false = pte_same(pte1, pte2) << So this fails
>>>>
>>>> The only way I can see false-positive not being caught in the second example is
>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>>> But these lockless table walkers are already suseptible to that.
>>>>
>>>> I think all the same arguments can be extended to the access bit.
>>>>
>>>>
>>>> For me this is all getting rather subtle and difficult to reason about and even
>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>
>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>> for
>>>> access and dirty information, which may be inconsistent if a racing
>>>> modification
>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>> Only
>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>
>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>> this
>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>
>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>> would sync against GUP-fast?
>>
>> No, the broadcast is in HW. There is no IPI.
>
> Okay ...
>
> I agree that the semantics are a bit weird, but if we could get rid of
> ptep_get_lockless() on arm64, that would also be nice.
>
>
> Something I've been thinking of ... just to share what I've had in mind. The two
> types of users we currently have are:
>
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>
> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>
> What if we had the following instead:
>
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>
> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
> PTL.
>
> And on arm64 let
>
> (1) ptep_get_lockless() be ptep_get()
>
> (2) ptep_get_gup_fast() be __ptep_get().
>
> That would mean, that (2) would not care if another cont-pte is dirty, because
> we don't collect access+dirty bits. That way, we avoid any races with concurrent
> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
> would have to set all cont-PTEs dirty, even if any of these already is dirty.
I don't think the "problematic" thing is actually a problem; set_ptes() will
always set the dirty bit to the same value for all ptes it covers (and if you do
set_ptes() on a partial contpte block, it will be unfolded first). Although I
suspect I've misunderstood what you meant there...
The potential problem I see with this is that the Arm ARM doesn't specify which
PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
randomly and this could spuriously increase your check failure rate. In reality
I believe most implementations will update the PTE for the address that caused
the TLB to be populated. But in some cases, you could have eviction (due to
pressure or explicit invalidation) followed by re-population due to faulting on
a different page of the contpte block. In this case you would see this type of
problem too.
But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
potentially false-negatives for access and dirty? Just with a much higher chance
of getting a false-negative. How is this helping?