Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

From: Jann Horn
Date: Mon Oct 24 2022 - 17:48:15 EST


On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> > On 10/22/22 04:14, Peter Zijlstra wrote:
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> > >
> > > #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> > > /*
> > > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > > - *
> > > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > > - * locks. For this we would like to load the pointers atomically, but sometimes
> > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What
> > > - * we do have is the guarantee that a PTE will only either go from not present
> > > - * to present, or present to not present or both -- it will not switch to a
> > > - * completely different present page without a TLB flush in between; something
> > > - * that we are blocking by holding interrupts off.
> > > + * For walking the pagetables without holding any locks. Some architectures
> > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > > + * instructions. We are guaranteed that a PTE will only either go from not
> > > + * present to present, or present to not present -- it will not switch to a
> > > + * completely different present page without a TLB flush inbetween; which we
> > > + * are blocking by holding interrupts off.
> >
> >
> > This is getting interesting. My latest understanding of this story is
> > that both the "before" and "after" versions of that comment are
> > incorrect! Because, as Jann Horn noticed recently [1], there might not
> > be any IPIs involved in a TLB flush, if x86 is running under a
> > hypervisor, and that breaks the chain of reasoning here.
>
> That mail doesn't really include enough detail. The way x86 HV TLB
> flushing is supposed to work is by making use of
> MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:
>
>
> vCPU0 vCPU1
>
> tlb_gather_mmut(&tlb, mm);
>
> ....
>
> local_irq_disable();
> ... starts page-table walk ...
>
> <schedules out; sets KVM_VCPU_PREEMPTED>
>
> tlb_finish_mmu(&tlb)
> ...
> kvm_flush_tlb_multi()
> if (state & KVM_VCPU_PREEMPTED)
> if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
> __cpumask_clear_cpu(cpu, flushmask);
>
>
> tlb_remove_table_sync_one() / call_rcu()
>
>
> <schedules back in>
>
> ... continues page-table walk ...
> local_irq_enable();
>
> If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
> pressure), then you've got your IPI back, otherwise it does call_rcu()
> and RCU itself will need vCPU0 to enable IRQs in order to make progress.
>
> Either way around, the actual freeing of the pages is delayed until the
> page-table walk is finished.
>
> What am I missing?

Unless I'm completely misunderstanding what's going on here, the whole
"remove_table" thing only happens when you "remove a table", meaning
you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
logic.