Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

From: Naoya Horiguchi
Date: Thu Feb 07 2019 - 21:34:53 EST


On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
> On 1/30/19 1:14 PM, Mike Kravetz wrote:
> > Files can be created and mapped in an explicitly mounted hugetlbfs
> > filesystem. If pages in such files are migrated, the filesystem
> > usage will not be decremented for the associated pages. This can
> > result in mmap or page allocation failures as it appears there are
> > fewer pages in the filesystem than there should be.
>
> Does anyone have a little time to take a look at this?
>
> While migration of hugetlb pages 'should' not be a common issue, we
> have seen it happen via soft memory errors/page poisoning in production
> environments. Didn't see a leak in that case as it was with pages in a
> Sys V shared mem segment. However, our DB code is starting to make use
> of files in explicitly mounted hugetlbfs filesystems. Therefore, we are
> more likely to hit this bug in the field.

Hi Mike,

Thank you for finding/reporting the problem.
# sorry for my late response.

>
> >
> > For example, a test program which hole punches, faults and migrates
> > pages in such a file (1G in size) will eventually fail because it
> > can not allocate a page. Reported counts and usage at time of failure:
> >
> > node0
> > 537 free_hugepages
> > 1024 nr_hugepages
> > 0 surplus_hugepages
> > node1
> > 1000 free_hugepages
> > 1024 nr_hugepages
> > 0 surplus_hugepages
> >
> > Filesystem Size Used Avail Use% Mounted on
> > nodev 4.0G 4.0G 0 100% /var/opt/hugepool
> >
> > Note that the filesystem shows 4G of pages used, while actual usage is
> > 511 pages (just under 1G). Failed trying to allocate page 512.
> >
> > If a hugetlb page is associated with an explicitly mounted filesystem,
> > this information in contained in the page_private field. At migration
> > time, this information is not preserved. To fix, simply transfer
> > page_private from old to new page at migration time if necessary. Also,
> > migrate_page_states() unconditionally clears page_private and PagePrivate
> > of the old page. It is unlikely, but possible that these fields could
> > be non-NULL and are needed at hugetlb free page time. So, do not touch
> > these fields for hugetlb pages.
> >
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
> > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > ---
> > fs/hugetlbfs/inode.c | 10 ++++++++++
> > mm/migrate.c | 10 ++++++++--
> > 2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 32920a10100e..fb6de1db8806 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
> > rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > if (rc != MIGRATEPAGE_SUCCESS)
> > return rc;
> > +
> > + /*
> > + * page_private is subpool pointer in hugetlb pages, transfer
> > + * if needed.
> > + */
> > + if (page_private(page) && !page_private(newpage)) {
> > + set_page_private(newpage, page_private(page));
> > + set_page_private(page, 0);

You don't have to copy PagePrivate flag?

> > + }
> > +
> > if (mode != MIGRATE_SYNC_NO_COPY)
> > migrate_page_copy(newpage, page);
> > else
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index f7e4bfdc13b7..0d9708803553 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, struct page *page)
> > */
> > if (PageSwapCache(page))
> > ClearPageSwapCache(page);
> > - ClearPagePrivate(page);
> > - set_page_private(page, 0);
> > + /*
> > + * Unlikely, but PagePrivate and page_private could potentially
> > + * contain information needed at hugetlb free page time.
> > + */
> > + if (!PageHuge(page)) {
> > + ClearPagePrivate(page);
> > + set_page_private(page, 0);
> > + }

# This argument is mainly for existing code...

According to the comment on migrate_page():

/*
* Common logic to directly migrate a single LRU page suitable for
* pages that do not use PagePrivate/PagePrivate2.
*
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping, ...

So this common logic assumes that page_private is not used, so why do
we explicitly clear page_private in migrate_page_states()?
buffer_migrate_page(), which is commonly used for the case when
page_private is used, does that clearing outside migrate_page_states().
So I thought that hugetlbfs_migrate_page() could do in the similar manner.
IOW, migrate_page_states() should not do anything on PagePrivate.
But there're a few other .migratepage callbacks, and I'm not sure all of
them are safe for the change, so this approach might not fit for a small fix.

# BTW, there seems a typo in $SUBJECT.

Thanks,
Naoya Horiguchi