Re: [PATCH] thp: use is_zero_pfn after pte_present check

From: Kirill A. Shutemov
Date: Mon Oct 12 2015 - 11:28:08 EST


On Mon, Oct 12, 2015 at 11:57:46PM +0900, Minchan Kim wrote:
> Hello,
>
> On Mon, Oct 12, 2015 at 01:13:20PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 12, 2015 at 10:54:16AM +0900, Minchan Kim wrote:
> > > Use is_zero_pfn on pteval only after pte_present check on pteval
> > > (It might be better idea to introduce is_zero_pte where checks
> > > pte_present first). Otherwise, it could work with swap or
> > > migration entry and if pte_pfn's result is equal to zero_pfn
> > > by chance, we lose user's data in __collapse_huge_page_copy.
> > > So if you're luck, the application is segfaulted and finally you
> > > could see below message when the application is exit.
> > >
> > > BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
> >
> > Did you acctually steped on the bug?
> > If yes it's subject for stable@, I think.
>
> Yes, I did with my testing program which made heavy swap-in/out/
> swapoff with MADV_DONTNEED in a memcg.
> Actually, I marked this patch as -stable but removed it right before
> sending because my test program is artificial and didn't see any
> report about rss bad counting with MM_SWAPENTS in linux-mm(Of course,
> I might miss it).
> In addition, sometime I saw someone insists on "It's not a stable
> material if it's not a bug with real workload". I don't want to
> involve such non-technical stuff so waited someone nudges me to
> mark it as -stable and finally, you did. ;-)
> If other reviewers are not against, I will Cc -stable in next spin.
>
> >
> > > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > > ---
> > >
> > > I found this bug with MADV_FREE hard test. Sometime, I saw
> > > "Bad rss-counter" message with MM_SWAPENTS but it's really
> > > rare, once a day if I was luck or once in five days if I was
> > > unlucky so I am doing test still and just pass a few days but
> > > I hope it will fix the issue.
> > >
> > > mm/huge_memory.c | 12 +++++++++++-
> > > 1 file changed, 11 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 4b06b8db9df2..349590aa4533 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2665,15 +2665,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > > for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> > > _pte++, _address += PAGE_SIZE) {
> > > pte_t pteval = *_pte;
> > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > + if (pte_none(pteval)) {
> >
> > In -mm tree we have is_swap_pte() check before this point in
> > khugepaged_scan_pmd()
>
> Actually, I tested this patch with v4.2 kernel so it doesn't have
> the check.
> Now, I look through optimistic check for swapin readahead patch
> in current mmotm.
> It seems the check couldn't prevent this problem because it releases
> pte lock and anon_vma lock before being isolated the page in
> __collapse_huge_page_isolate so the page could be swapped out again.
>
> >
> > Also, what about similar pattern in __collapse_huge_page_isolate() and
> > __collapse_huge_page_copy()? Shouldn't they be fixed as well?
>
> I see what's wrong here.
> /me slaps self.
> The line I was about to change was in __collapse_huge_page_isolate
> but I changed khugepaged_scan_pmd by mistake at last modification
> since that part is almost same. :(
> Fortunately my testing kernel is doing right version.
> Here it goes.
>
> From 2a2e4b247e132d823af30655dbc0b57738e9d6ee Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@xxxxxxxxxx>
> Date: Mon, 12 Oct 2015 09:52:46 +0900
> Subject: [PATCH] thp: use is_zero_pfn only after pte_present check
>
> Use is_zero_pfn on pteval only after pte_present check on pteval
> (It might be better idea to introduce is_zero_pte where checks
> pte_present first). Otherwise, it could work with swap or
> migration entry and if pte_pfn's result is equal to zero_pfn
> by chance, we lose user's data in __collapse_huge_page_copy.
> So if you're luck, the application is segfaulted and finally you
> could see below message when the application is exit.
>
> BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3
>
> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>

Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

> ---
> mm/huge_memory.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4b06b8db9df2..bbac913f96bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2206,7 +2206,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
> _pte++, address += PAGE_SIZE) {
> pte_t pteval = *_pte;
> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> + if (pte_none(pteval) || (pte_present(pteval) &&
> + is_zero_pfn(pte_pfn(pteval)))) {
> if (!userfaultfd_armed(vma) &&
> ++none_or_zero <= khugepaged_max_ptes_none)
> continue;
> --
> 1.9.1
>
>
> In khugepaged_scan_pmd, although there is no is_swap_pte check in
> v4.2, we don't need to check pte_present check right before is_zero_pfn
> because that part is just scanning operation so even if something wrong
> happens rarely, it should filter out in __collapse_huge_page_isolate
> with this patch.
>
> In __collapse_huge_page_copy, we don't need the check, either.
> Because every ptes in the vma's 2M area point out isolated LRU pages
> and zero page so any pages couldn't be swap-out.
>
> Thanks for the review.

--
Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/