Re: [PATCH] huegtlbfs: fix races and page leaks during migration

From: Naoya Horiguchi
Date: Tue Feb 26 2019 - 02:46:27 EST

Next message: Mike Rapoport: "Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range"
Previous message: Peter Xu: "Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update"
Next in thread: Mike Kravetz: "Re: [PATCH] huegtlbfs: fix races and page leaks during migration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Mike,

On Thu, Feb 21, 2019 at 11:11:06AM -0800, Mike Kravetz wrote:
> On 2/20/19 10:09 PM, Andrew Morton wrote:
> > On Tue, 12 Feb 2019 14:14:00 -0800 Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index a80832487981..f859e319e3eb 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c

...

> >> @@ -3863,6 +3862,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >> }
> >>
> >> spin_unlock(ptl);
> >> +
> >> + /* May already be set if not newly allocated page */
> >> + set_page_huge_active(page);
> >> +
>
> This is wrong. We need to only set_page_huge_active() for newly allocated
> pages. Why? We could have got the page from the pagecache, and it could
> be that the page is !page_huge_active() because it has been isolated for
> migration. Therefore, we do not want to set it active here.
>
> I have also found another race with migration when removing a page from
> a file. When a huge page is removed from the pagecache, the page_mapping()
> field is cleared yet page_private continues to point to the subpool until
> the page is actually freed by free_huge_page(). free_huge_page is what
> adjusts the counts for the subpool. A page could be migrated while in this
> state. However, since page_mapping() is not set the hugetlbfs specific
> routine to transfer page_private is not called and we leak the page count
> in the filesystem. To fix, check for this condition before migrating a huge
> page. If the condition is detected, return EBUSY for the page.
>
> Both issues are addressed in the updated patch below.
>
> Sorry for the churn. As I find and fix one issue I seem to discover another.
> There is still at least one more issue with private pages when COW comes into
> play. I continue to work that. I wanted to send this patch earlier as it
> is pretty easy to hit the bugs if you try. If you would prefer another
> approach, let me know.
>
> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Date: Thu, 21 Feb 2019 11:01:04 -0800
> Subject: [PATCH] huegtlbfs: fix races and page leaks during migration

Subject still contains a typo.

>
> hugetlb pages should only be migrated if they are 'active'. The routines
> set/clear_page_huge_active() modify the active state of hugetlb pages.
> When a new hugetlb page is allocated at fault time, set_page_huge_active
> is called before the page is locked. Therefore, another thread could
> race and migrate the page while it is being added to page table by the
> fault code. This race is somewhat hard to trigger, but can be seen by
> strategically adding udelay to simulate worst case scheduling behavior.
> Depending on 'how' the code races, various BUG()s could be triggered.
>
> To address this issue, simply delay the set_page_huge_active call until
> after the page is successfully added to the page table.
>
> Hugetlb pages can also be leaked at migration time if the pages are
> associated with a file in an explicitly mounted hugetlbfs filesystem.
> For example, consider a two node system with 4GB worth of huge pages
> available. A program mmaps a 2G file in a hugetlbfs filesystem. It
> then migrates the pages associated with the file from one node to
> another. When the program exits, huge page counts are as follows:
>
> node0
> 1024 free_hugepages
> 1024 nr_hugepages
>
> node1
> 0 free_hugepages
> 1024 nr_hugepages
>
> Filesystem Size Used Avail Use% Mounted on
> nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
>
> That is as expected. 2G of huge pages are taken from the free_hugepages
> counts, and 2G is the size of the file in the explicitly mounted filesystem.
> If the file is then removed, the counts become:
>
> node0
> 1024 free_hugepages
> 1024 nr_hugepages
>
> node1
> 1024 free_hugepages
> 1024 nr_hugepages
>
> Filesystem Size Used Avail Use% Mounted on
> nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
>
> Note that the filesystem still shows 2G of pages used, while there
> actually are no huge pages in use. The only way to 'fix' the
> filesystem accounting is to unmount the filesystem
>
> If a hugetlb page is associated with an explicitly mounted filesystem,
> this information in contained in the page_private field. At migration
> time, this information is not preserved. To fix, simply transfer
> page_private from old to new page at migration time if necessary.
>
> There is a related race with removing a huge page from a file migration.
> When a huge page is removed from the pagecache, the page_mapping() field
> is cleared yet page_private remains set until the page is actually freed
> by free_huge_page(). A page could be migrated while in this state.
> However, since page_mapping() is not set the hugetlbfs specific routine
> to transfer page_private is not called and we leak the page count in the
> filesystem. To fix, check for this condition before migrating a huge
> page. If the condition is detected, return EBUSY for the page.
>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
> fs/hugetlbfs/inode.c | 12 ++++++++++++
> mm/hugetlb.c | 12 +++++++++---
> mm/migrate.c | 11 +++++++++++
> 3 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 32920a10100e..a7fa037b876b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct address_space
> *mapping,
> rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> if (rc != MIGRATEPAGE_SUCCESS)
> return rc;
> +
> + /*
> + * page_private is subpool pointer in hugetlb pages. Transfer to
> + * new page. PagePrivate is not associated with page_private for
> + * hugetlb pages and can not be set here as only page_huge_active
> + * pages can be migrated.
> + */
> + if (page_private(page)) {
> + set_page_private(newpage, page_private(page));
> + set_page_private(page, 0);
> + }
> +
> if (mode != MIGRATE_SYNC_NO_COPY)
> migrate_page_copy(newpage, page);
> else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a80832487981..e9c92e925b7e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
...
> @@ -3863,6 +3864,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
>
> spin_unlock(ptl);
> +
> + /* Make newly allocated pages active */

You already have a perfect explanation about why we need this "if",

> ... We could have got the page from the pagecache, and it could
> be that the page is !page_huge_active() because it has been isolated for
> migration.

so you could improve this comment with it.

Anyway, I agree to what/how you try to fix.

Reviewed-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>

Thanks,
Naoya Horiguchi

Next message: Mike Rapoport: "Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range"
Previous message: Peter Xu: "Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update"
Next in thread: Mike Kravetz: "Re: [PATCH] huegtlbfs: fix races and page leaks during migration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]