Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
From: Michal Hocko
Date: Mon Aug 28 2017 - 14:09:23 EST
On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
> Adding Andrew, Michal on CC
>
> On 08/27/2017 01:08 PM, Nadav Amit wrote:
> > Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> >
> >> On 08/26/2017 12:11 PM, Nadav Amit wrote:
> >>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> >>> This scenario opens a small time window, from the time the page is added
> >>> to the page cache, until it is unlocked, in which the page might be
> >>> removed from the page-cache by another core. If the page is removed
> >>> during this time windows, it might cause a memory corruption, as the
> >>> wrong page will be unlocked.
> >>>
> >>> It is arguable whether this scenario can happen in a real system, and
> >>> there are several mitigating factors. The issue was found by code
> >>> inspection (actually grep), and not by actually triggering the flow.
> >>> Yet, since putting the page before unlocking is incorrect it should be
> >>> fixed, if only to prevent future breakage or someone copy-pasting this
> >>> code.
> >>>
> >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> >>>
> >>> cc: Eric Biggers <ebiggers3@xxxxxxxxx>
> >>> cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> >>>
> >>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx>
> >>
> >> Thank you Nadav.
> >
> > No problem.
> >
> >>
> >> Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> >>
> >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> >> able to remove a page (file content) is through an inode operation such as
> >> truncate, hole punch, or unlink. That was the basis for my response that
> >> the inode lock would be required for page freeing.
> >>
> >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> >> I was expecting to see a check for hugetlbfs pages and exit (without
> >> modification) if encountered. A quick review of the code did not find
> >> any such checks.
> >>
> >> I'll take a closer look to determine exactly how hugetlbfs files are
> >> handled. IMO, there should be something similar to the DAX check where
> >> the routine quickly exits.
> >
> > I did not cc stable when submitting the patch, based on your previous
> > response. Let me know if you want me to send v2 which does so.
>
> I still do not believe there is a need to change this in stable. Your patch
> should be sufficient to ensure we do the right thing going forward.
>
> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
> hugetlbfs does indeed show a more general problem. One can use
> sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does
> not go through the special hugetlbfs page handling code, but rather the
> normal mm paths. As a result hugetlbfs accounting (like reserve counts)
> gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!!
>
> I will address this issue in a separate patch.
I didn't check very carefully but it seems that
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
should help here, right?
--
Michal Hocko
SUSE Labs