Re: [PATCH v2 3/4] mm/memory: split non-tlb flushing part from zap_page_range_single()

From: Lorenzo Stoakes
Date: Wed Apr 09 2025 - 06:41:15 EST


On Tue, Apr 08, 2025 at 01:12:48PM -0700, SeongJae Park wrote:
> On Tue, 8 Apr 2025 14:29:41 +0100 Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote:
>
> > On Fri, Apr 04, 2025 at 02:06:59PM -0700, SeongJae Park wrote:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() flushes tlb for each invocation. Split out the
> > > body of zap_page_range_single() except mmu_gather object initialization
> > > and gathered tlb entries flushing for such batched tlb flushing usage.
> > >
> > > To avoid hugetlb pages allocation failures from concurrent page faults,
> > > the tlb flush should be done before hugetlb faults unlocking, though.
> > > Do the flush and the unlock inside the split out function in the order
> > > for hugetlb vma case. Refer to commit 2820b0f09be9 ("hugetlbfs: close
> > > race between MADV_DONTNEED and page fault") for more details about the
> > > concurrent faults' page allocation failure problem.
> >
> > Good lord, I really hate how we do hugetlb.
> >
> > >
> > > Signed-off-by: SeongJae Park <sj@xxxxxxxxxx>
> > > ---
> > > mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++----------
> > > 1 file changed, 39 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8669b2c981a5..8c9bbb1a008c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1989,36 +1989,65 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > > mmu_notifier_invalidate_range_end(&range);
> > > }
> > >
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > +/*
> > > + * notify_unmap_single_vma - remove user pages in a given range
> > > + * @tlb: pointer to the caller's struct mmu_gather
> > > * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > + * @address: starting address of pages to remove
> > > + * @size: number of bytes to remove
> > > * @details: details of shared cache invalidation
> > > *
> > > - * The range must fit into one VMA.
> > > + * @tlb shouldn't be NULL. The range must fit into one VMA. If @vma is for
> > > + * hugetlb, @tlb is flushed and re-initialized by this function.
> > > */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void notify_unmap_single_vma(struct mmu_gather *tlb,
> > > + struct vm_area_struct *vma, unsigned long address,
> > > unsigned long size, struct zap_details *details)
> >
> > Don't love this name. It seems to imply that the primary thing here is the MMU
> > notification bit.
> >
> > This is like unmap_vmas() but for a single VMA, that is - contains the
> > logic unmap_vmas() does for mmu notifier stuff and hugetlb stuff (vom in
> > mouth), deferring heavy lifting to unmap_single_vma().
> >
> > I think it might be better to just go with the brainless
> > '__zap_page_range_single()' here honestly. Then we at least reduce the mess
> > of confusion in this function naming here.
> >
> > Of course you intend to un-static this shortly... so maybe that's not so
> > great.
> >
> > zap_page_range_single_batched()?
>
> Sounds good to me. I will rename the function so, unless we get other
> opinions.
>
> >
> > > {
> > > const unsigned long end = address + size;
> > > struct mmu_notifier_range range;
> > > - struct mmu_gather tlb;
> > > +
> > > + VM_WARN_ON_ONCE(!tlb);
> >
> > Maybe pedantic, but we probably want to ensure not only that tlb is set but
> > also has had tlb_gatehr_mmu() performed upon it. Then again probably a bit
> > much given this is a static function called only from
> > zap_page_range_single().
> >
> > Hm actually you intend to un-static this shortly right? So I guess in that
> > case we do want some kind of check like this, perhaps absracting this bit
> > of tlb_flush_mmu_tlbonly():
> >
> > if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> > tlb->cleared_puds || tlb->cleared_p4ds))
> > return;
>
> This code is to see if there are tlb entries to flush.
>
> In this function, valid 'tlb' can either have such entries or not. So this
> code could be good reference but couldn't be used as is. I think most fields
> of mmu_gather can be arbitrarily updated depending on use cases, so making a
> perfect but simple check wouldn't be easy.
>
> I think tlb->mm shouldn't be changed after the initialization, though. So
> (tlb->mm == vma->vm_mm) could be a good enough additional check.

Yeah this is not _that_ much of a big deal, just maybe a nice-to-have, but
appreciate that it's sort of blurry.

Don't feel obliged on the separated out check, mm == vma->vm_mm suffices
(though again that is also not a big deal! :)

>
> >
> > Into a separate is_tlb_flushable() helper function or something. Then this
> > warning can become:
> >
> > /* tlb should be initialised for a new gather operation. */
> > VM_WARN_ON_ONCE(!tlb || is_tlb_flushable(tlb));
>
> If you agree (tlb->mm == vma->vm_mm) is sufficient check, this could be
>
> VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm)

Sure, that's fine.

>
> >
> > >
> > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > > address, end);
> > > hugetlb_zap_begin(vma, &range.start, &range.end);
> >
> > Is it a problem that we invoke this function now _after_ tlb_gather_mmu()
> > has begun?
>
> I think it is not a problem since I find no requirements of the ordering.
> Please let me know if I'm missing seomthing.
>
> >
> > > - tlb_gather_mmu(&tlb, vma->vm_mm);
> > > update_hiwater_rss(vma->vm_mm);
> > > mmu_notifier_invalidate_range_start(&range);
> > > /*
> > > * unmap 'address-end' not 'range.start-range.end' as range
> > > * could have been expanded for hugetlb pmd sharing.
> > > */
> >
> > Oh GOD I HATE THAT WE HANDLE HUGETLB THIS WAY!!!
> >
> > Anything where you have to open-code concerns about how a specific use case
> > uses something like this is just asking to be broken.
> >
> > Obviously this is nothing to do with your series and is just a grumble to
> > the sky, but still! *shakes fist at cloud*
> >
> > > - unmap_single_vma(&tlb, vma, address, end, details, false);
> > > + unmap_single_vma(tlb, vma, address, end, details, false);
> > > mmu_notifier_invalidate_range_end(&range);
> > > + if (is_vm_hugetlb_page(vma)) {
> > > + /*
> > > + * flush tlb and free resources before hugetlb_zap_end(), to
> > > + * avoid concurrent page faults' allocation failure
> >
> > Nit: add a full stop (or for those residing in North America 'period' :>)
> > at the end of this sentence. This is very bikesheddy I know, and can only
> > apologise.
>
> Thank you for kindly suggesting this, I will update as you recommended in the
> next spin.
>
> >
> > > + */
> > > + tlb_finish_mmu(tlb);
> > > + hugetlb_zap_end(vma, details);
> > > + tlb_gather_mmu(tlb, vma->vm_mm);
> > > + }
> >
> > OK, so as far as I can tell, after this change, we are still ensuring that
> > the tlb is flushed _prior_ to the invocation of hugetlb_zap_end(), only we,
> > in order to later export this function, need do it here instead... I mean
> > this is horrid, but it's sort of unavoidable really.
> >
> > So I guess this just undoes the optimisation in the case of hugetlb, which
> > probably doesn't really matter all that much at huge page size
> > anyway... and plus it's hugetlb so.
>
> Yes, that's the intended behavior.
>
> >
> > Yeah fine, just kind of horrid.
>
> I agree this is not clean, and thank you for understanding the situation.
> Hopefully we will find a chance to revisit for cleanup later.

Sure!

>
> >
> > > +}
> > > +
> > > +/**
> > > + * zap_page_range_single - remove user pages in a given range
> > > + * @vma: vm_area_struct holding the applicable pages
> > > + * @address: starting address of pages to zap
> > > + * @size: number of bytes to zap
> > > + * @details: details of shared cache invalidation
> > > + *
> > > + * The range must fit into one VMA.
> > > + */
> > > +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > + unsigned long size, struct zap_details *details)
> > > +{
> > > + struct mmu_gather tlb;
> > > +
> > > + tlb_gather_mmu(&tlb, vma->vm_mm);
> > > + notify_unmap_single_vma(&tlb, vma, address, size, details);
> > > tlb_finish_mmu(&tlb);
> > > - hugetlb_zap_end(vma, details);
> > > }
> > >
> > > /**
> > > --
> > > 2.39.5
>
> Thank you for taking your time for reviewing this. If you have no other
> opinions, I will make the next spin as mentioned above.

No problems, yes all fine for respin!

>
>
> Thanks,
> SJ