Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED

From: Yang Shi
Date: Mon Nov 29 2021 - 17:41:23 EST


On Fri, Nov 26, 2021 at 1:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> >>
> >> Thanks for making me rerun this and yes indeed I had a very silly bug in the
> >> benchmark code (i.e. madvise the same page for the whole loop) and this is
> >> indeed several times slower than without the patch (sorry David for misleading
> >> you).
>
> No worries, BUGs happen :)
>
> >>
> >> To better understand what is happening, I profiled the benchmark:
> >>
> >> - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync
> >> - 31.27% zap_page_range_sync
> >> - 30.25% split_local_deferred_list
> >> - 30.16% split_huge_page_to_list
> >> - 21.05% try_to_migrate
> >> + rmap_walk_anon
> >> - 7.47% remove_migration_ptes
> >> + 7.34% rmap_walk_locked
> >> + 1.02% zap_page_range_details
> >
> > Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk
> > itself.
> >
> > I believe this effect will be exaggerated when the mapping is shared,
> > e.g. shmem file thp between processes. What's worse is that when one process
> > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without
> > even being noticed, so that's a direct suffer from perf degrade.
>
> Would this really apply to MADV_DONTNEED on shmem, and would deferred
> splitting apply on shmem? I'm constantly confused about shmem vs. anon,
> but I would have assumed that shmem is fd-based and we wouldn't end up
> in rmap_walk_anon. For shmem, the pagecache would contain the THP which
> would stick around and deferred splits don't even apply.

The deferred split is anon THP only, it doesn't apply to shmem THP.

For the rmap walk, there are two ramp walks for anon THP, but just one
for shmem THP. Both needs one rmap walk to unmap the THP before doing
actual split, but anon THP needs another rmap walk to reinstall PTEs
for still mapped pages, however this is not needed for shmem pages
since they could be reached via page cache.

>
> But again, I'm constantly confused so I'd love to be enlighted.
>
> >
> >>
> >> The overhead is not due to copying page flags but rather due to two rmap walks.
> >> I don't think this much overhead is justified for current users of MADV_DONTNEED
> >> and munmap. I have to rethink the approach.
>
> Most probably not.
>
> >
> > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I
> > thought about MADV_SPLIT (or any of its variance):
> >
> > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@xxxxxxxxxx/
> >
> > My memory was that there's some issue to be solved so that was blocked, however
> > when I read the thread it sounds like the list was mostly reaching a consensus
> > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I
> > missed something important.
> >
> > If we think MADV_COLLAPSE can help to implement an userspace (and more
> > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of
> > kcompactd, perhaps.
> >
> > That's probably a bit off topic of this specific discussion on the specific use
> > case, but so far it seems all reasonable and discussable.
>
> User space can trigger a split manually using some MADV hackery. But it
> can only be used for the use case here, where we actually want to zap a
> page.
>
> 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE
> and the compound page.
> 2. MADV_DONTNEED either the complete range or the single 4k page.
>
> --
> Thanks,
>
> David / dhildenb
>