Re: [PATCH RFC v2 1/2] mm: Don't skip swap entry even if zap_details specified

From: Peter Xu
Date: Wed Jan 12 2022 - 08:26:17 EST


On Wed, Jan 12, 2022 at 09:18:09PM +0800, Peter Xu wrote:
> On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> > On Mon, 15 Nov 2021, Peter Xu wrote:
> >
> > > This check existed since the 1st git commit of Linux repository, but at that
> > > time there's no page migration yet so I think it's okay.
> >
> > //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> > helps with the history. Yes, the check was okay back then,
> > but a lot of changes have come in since: I'll tell more of those below.
>
> Thanks for looking at this. By the way, the link is greatly helpful. It's
> always good to be able to read into the history.
>
> >
> > You are absolutely right to clean this up and fix the bugs that
> > have crept in, but I think the patch itself is not quite right yet.
>
> Do you mean the pmd path on checking mapping? If so I agree, and I'll add that
> in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
> analyzed below).
>
> Let me know if I missed anything else..
>
> >
> > >
> > > With page migration enabled, it should logically be possible that we zap some
> > > shmem pages during migration.
> >
> > Yes.
> >
> > > When that happens, IIUC the old code could have
> > > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > > without decreasing the counters for the migrating entries. I have no unit test
> > > to prove it as I don't know an easy way to trigger this condition, though.
> >
> > In the no-details case, yes, it does look like that. I ought to try
> > and reproduce that, but responding to mails seems more important.
>
> Please let me know if you know how to reproduce it, since I don't know yet a
> real reproducer.
>
> What I can do, though, is if with below patch applied to current linux master:
>
> =============
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..51fe02a22ea9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> }
>
> /* If details->check_mapping, we leave swap entries. */
> - if (unlikely(details))
> + if (unlikely(details)) {
> + WARN_ON_ONCE(is_migration_entry(entry));
> continue;
> + }
>
> if (!non_swap_entry(entry))
> rss[MM_SWAPENTS]--;
> =============
>
> Then I can easily trigger it if with the help of a test program attached
> (zap-migrate.c).

(Attaching for real; also copy Matthew)

>
> IMHO it won't really reproduce because it seems all relevant shmem operations
> (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> migration (which needs the page lock too), as mentioned in previous cover
> letters when I was still working on the old versions of this. But it could be
> possible that I missed something.
>
> While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> punching we have the pre-unmap without lock:
>
> if ((u64)unmap_end > (u64)unmap_start)
> unmap_mapping_range(mapping, unmap_start,
> 1 + unmap_end - unmap_start, 0);
> shmem_truncate_range(inode, offset, offset + len - 1);
>
> But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> which is with-lock. Hence we have a small window to at least trigger the
> warning above, even if it won't really mess up the accounting finally.
>
> >
> > >
> > > Besides, the optimization itself is already confusing IMHO to me in a few points:
> >
> > It is confusing and unnecessary and wrong, I agree.
> >
> > >
> > > - The wording "skip swap entries" is confusing, because we're not skipping all
> > > swap entries - we handle device private/exclusive pages before that.
> >
> > I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> > but of course it's good if you can bring the cases closer together.
> >
> > >
> > > - The skip behavior is enabled as long as zap_details pointer passed over.
> > > It's very hard to figure that out for a new zap caller because it's unclear
> > > why we should skip swap entries when we have zap_details specified.
> >
> > History below will clarify that.
> >
> > >
> > > - With modern systems, especially performance critical use cases, swap
> > > entries should be rare, so I doubt the usefulness of this optimization
> > > since it should be on a slow path anyway.
> > >
> > > - It is not aligned with what we do with huge pmd swap entries, where in
> > > zap_huge_pmd() we'll do the accounting unconditionally.
> >
> > The patch below does not align with what's done in zap_huge_pmd() either;
> > but I think zap_huge_pmd() is okay without "details" because its only swap
> > entries are migration entries, and we do not use huge pages when COWing
> > from file huge pages.
> >
> > >
> > > This patch drops that trick, so we handle swap ptes coherently. Meanwhile we
> > > should do the same mapping check upon migration entries too.
> > >
> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > ---
> > > mm/memory.c | 6 ++----
> > > 1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f1de811a1dc..e454f3c6aeb9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > continue;
> > > }
> > >
> > > - /* If details->check_mapping, we leave swap entries. */
> > > - if (unlikely(details))
> > > - continue;
> > > -
> >
> > Good.
> >
> > > if (!non_swap_entry(entry))
> > > rss[MM_SWAPENTS]--;
> > > else if (is_migration_entry(entry)) {
> > > struct page *page;
> > >
> > > page = pfn_swap_entry_to_page(entry);
> > > + if (unlikely(zap_skip_check_mapping(details, page)))
> > > + continue;
> >
> > Good; though I don't think it's worth an "unlikely", and I say again,
>
> Sure, I'll drop this "unlikely". Meanwhile, shall we drop all the rest of the
> "unlikely" too around this check?
>
> > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> >
> > David suggested naming its inversion should_zap_page(), yes, that's
> > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > but should_zap_page() for the inversion sounds good to me.
>
> Ah sure, I'll rename it to should_zap_page() in a new patch before this.
>
> >
> > And I'm pleased to see in one of Matthew's diffs that he intends to
> > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > from include/linux/mm.h.
>
> Yeah it's only used in memory.c. I put it there initially because I wanted
> zap_details user can reference what's that check mapping is all about, but
> maybe that's not necessary.
>
> If you or Matthew could provide a link to that patch, I'll be happy to pick
> that up too with this series. Or I can also do nothing assuming Matthew will
> handle it elsewhere.
>
> >
> > > rss[mm_counter(page)]--;
> > > }
> > > if (unlikely(!free_swap_and_cache(entry)))
> > > --
> > > 2.32.0
> >
> > History. zap_details came in 2.6.6, and it was mostly to manage
> > truncation on the non-linear vmas we had at that time (remap_file_pages
> > syscall packing non-sequential offsets into a single vma, with pte_file
> > entries), where index could not be deduced from position of pte in vma:
> > truncation range had to be passed down in zap_details; and an madvise
> > needed it too, so it could not be private to mm/memory.c then.
> >
> > But at the same time, I added the even_cows argument to
> > unmap_mapping_range(), to distinguish truncation (which POSIX requires
> > to unmap even COWed pages) from invalidation (for page cache coherence,
> > which shouldn't touch private COWs). However, there appear to be no
> > users of that in 2.6.6, though I wouldn't have added that complication
> > just for the fun of confusing everyone: best guess would be that there
> > was parallel development, and the use for !even_cows got removed in
> > the very same release that it was being added.
> >
> > (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> > of comparing details->check_mapping, or maybe there's some other case
> > I've forgotten which actually needs the exact mapping check.)
> >
> > Eventually a few drivers came to use unmap_shared_mapping_range(),
> > the !even_cows caller; but more to the point, hole-punching came in,
> > and I felt very strongly that hole-punching on a file had no right
> > to delete private user data. So then !even_cows became useful.
> >
> > IIRC, I've seen Linus say he *detests* zap_details. It had much better
> > justification in the days of non-linear, and I was sad to add
> > single_page to it quite recently; but hope that can go away later
> > (when try_to_unmap_one() is properly extended to THPs).
> >
> > Now, here's what may clear up a lot of the confusion.
> > The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> > if (details && !details->check_mapping && !details->nonlinear_vma)
> > details = NULL;
> > which paired with the
> > /*
> > * If details->check_mapping, we leave swap entries;
> > * if details->nonlinear_vma, we leave file entries.
> > */
> > if (unlikely(details))
> > continue;
> > lower down. I haven't followed up the actual commits, but ChangeLog-2.6.7
> > implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> > In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> > It was intended, not so much to optimize, as to simplify the flow;
> > but in fact has caused all this confusion.
> >
> > When Kirill discontinued non-linear mapping support in 4.0 (no tears
> > were shed and) the nonlinear_vma comment above got deleted, leaving
> > just the then more puzzling check_mapping comment.
> >
> > Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> > aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> > details (ignore_dirty and check_swap_entries); which got reverted in
> > 4.11, but apparently the "details = NULL;" not restored.
> >
> > My stamina is suddenly exhausted, without actually pinpointing a commit
> > for "Fixes:" in your eventual cleanup. Sorry, I've had enough!
>
> Yeah it's in most cases a pain for digging all these trivial details, thanks
> for digging already most of it out of the mist.
>
> That's really what I hope this patch can help: not only because the uffd work
> will rely on it, but also on resolving this early (we do use some wordings like
> "technical debt" sometimes, I think it's the same here but different form) when
> the above "history.git" is still functional so we can still reference.
>
> With your help and the history.git I can try a better commit message because
> obviously some of the contents needs amending (it's not a pure optimization at
> all), but I assume the patch content will be mostly the same, with the tweaks
> applied.
>
> Per stated so far I don't know any real reproducer so maybe it's not a real
> issue in any production system? Maybe that'll make it a bit easier, because
> then we don't strongly require a Fixes (which could be another hard question to
> answer besides the issue itself).
>
> Thanks,
>
> --
> Peter Xu

--
Peter Xu
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
#include <stdio.h>
#include <numaif.h>
#include <unistd.h>
#include <sys/mman.h>
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>

#define BSIZE (1UL << 28) /* 64MB */
#define DELAY 30 /* Something bigger than 10 */

int page_size, npages, *status, *nodes;
pthread_t thread;
char *buffer;
void **pages;
sem_t sem;
char str[4096];
int shmem_fd;

void *evictor_thread(void *data)
{
int i, ret;

printf("evictor created\n");

for (i = 0; i < npages; i++) {
sem_wait(&sem);
usleep(random() % DELAY);
ret = fallocate(shmem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
i * page_size, page_size);
assert(ret == 0);
}

printf("evictor quitted\n");

return NULL;
}

void do_init(void)
{
int i, ret;

srand(time(NULL));

sem_init(&sem, 0, 0);
page_size = getpagesize();
npages = BSIZE / page_size;

shmem_fd = memfd_create("zap-migrate", 0);
assert(shmem_fd >= 0);

ret = ftruncate(shmem_fd, BSIZE);
assert(ret == 0);

buffer = mmap(NULL, BSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, shmem_fd, 0);
assert(buffer != MAP_FAILED);

pages = calloc(npages, sizeof(void *));
status = calloc(npages, sizeof(int));
nodes = calloc(npages, sizeof(int));
assert(pages && status && nodes);

for (i = 0; i < npages; i++) {
*(buffer + page_size * i) = i;
}

ret = pthread_create(&thread, NULL, evictor_thread, NULL);
assert(ret == 0);
}

void do_quit(void)
{
free(pages);
free(status);
free(nodes);
munmap(buffer, BSIZE);
close(shmem_fd);
}

void move_all_to_node(int node)
{
long ret;
int i;

for (i = 0; i < npages; i++) {
pages[i] = buffer + page_size * i;
nodes[i] = node;
status[i] = 0;
}

ret = move_pages(0, npages, pages, nodes, status, MPOL_MF_MOVE);
assert(ret == 0);

printf("%s node=%d\n", __func__, node);
}

void move_each_to_node(int node)
{
long ret;
int i;

for (i = 0; i < npages; i++) {
pages[0] = buffer + page_size * i;
nodes[0] = node;
status[0] = -1;

/* Kick the evictor */
sem_post(&sem);
usleep(10);
ret = move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE);
assert(ret == 0);
}

printf("%s node=%d\n", __func__, node);
}

void check(void)
{
pid_t pid = getpid();
int fd, ret;
char *p, *end;

sprintf(str, "/proc/%d/smaps_rollup", pid);
printf("Dump file: %s\n==========\n", str);
fd = open(str, O_RDONLY);
assert(fd >= 0);

while (1) {
ret = read(fd, str, sizeof(str)-1);
assert(ret >= 0);
if (ret == 0)
break;
}

p = strstr(str, "Pss_Shmem:");
end = strchr(p, '\n');
*end = 0;
printf("%s\n", p);

close(fd);
}

void main(void)
{
do_init();
move_all_to_node(0);
move_each_to_node(1);
/* Make sure madvise() all ran */
pthread_join(thread, NULL);
/* Should be "0KB" */
check();
do_quit();
}