Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults

From: Hugh Dickins
Date: Tue Oct 27 2015 - 23:34:31 EST


On Tue, 20 Oct 2015, Mike Kravetz wrote:

> The hugetlbfs fallocate hole punch code can race with page faults. The
> result is that after a hole punch operation, pages may remain within the
> hole. No other side effects of this race were observed.
>
> In preparation for adding userfaultfd support to hugetlbfs, it is desirable
> to close the window of this race. This patch set starts by using the same
> mechanism employed in shmem (see commit f00cdc6df7). This greatly reduces
> the race window. However, it is still possible for the race to occur.
>
> The current hugetlbfs code to remove pages did not deal with pages that
> were mapped (because of such a race). This patch set also adds code to
> unmap pages in this rare case. This unmapping of a single page happens
> under the hugetlb_fault_mutex, so it can not be faulted again until the
> end of the operation.
>
> v2:
> Incorporated Andrew Morton's cleanups and added suggested comments
> Added patch 4/4 to unmap single pages in remove_inode_hugepages
>
> Mike Kravetz (4):
> mm/hugetlb: Define hugetlb_falloc structure for hole punch race
> mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
> mm/hugetlb: page faults check for fallocate hole punch in progress and
> wait
> mm/hugetlb: Unmap pages to remove if page fault raced with hole punch
>
> fs/hugetlbfs/inode.c | 155 ++++++++++++++++++++++++++++--------------------
> include/linux/hugetlb.h | 10 ++++
> mm/hugetlb.c | 39 ++++++++++++
> 3 files changed, 141 insertions(+), 63 deletions(-)

With the addition of i_mmap_lock_write() around hugetlb_vmdelete_list()
in 4/4 (that you already sent a patch for), and two very minor and
inessential mods to the test in 3/4 that I'll suggest in reply to that,
this all looks correct to me.

And yet, and yet...

... I have to say that it looks like a bunch of unnecessary complexity:
for which the only justification you give is above, not in any of the
patches: "In preparation for adding userfaultfd support to hugetlbfs,
it is desirable to close the window of this race" (pages faulted into
the hole during holepunch may be left there afterwards).

Of course, the code you've sampled from shmem.c is superb ;) and it
all should work as you have it (and I wouldn't want to tinker with it,
to try and make it simpler here, it's just to easy to get it wrong);
but it went into shmem.c for very different reasons.

I don't know whether you studied the sequence of 1aac1400319d before,
then f00cdc6df7d7 which you cite, then 8e205f779d14 which corrected it,
then b1a366500bd5 which plugged the remaining holes: I had to refresh
my memory of how it came to be like this.

The driving reasons for it in shmem.c were my aversion to making the
shmem inode any bigger for such an offbeat case (but hugetlbfs wouldn't
expect a large number of tiny files, to justify such space saving); my
aversion to adding further locking into the fast path of shmem_fault();
the fact that we already had this fallocate range checking mechanism
from the earliest of those commits; and the CVE number that meant that
the starvation issue had to be taken more seriously than it deserved.

One of the causes of that starvation issue was the fallback unmapping
of single pages within the holepunch loop: your 4/4 actually introduces
that into hugetlbfs for the first time. Another cause of the starvation
issue was the "for ( ; ; )" pincer loop that I reverted in the last of
those commits above: whereas hugetlbfs just has a straightforward start
to end loop there.

It's inherent in the nature of holepunching, that immediately after
the holepunch someone may fault into the hole: I wouldn't have bothered
to make those changes in shmem.c, if there hadn't been the CVE starvation
issue - which I doubt hugetlbfs has before your changes.

But I've found some linux-mm mail from you to Andrea on Sep 30, in a
00/12 userfaultfd thread, where you say "I would like a mechanism to
catch all new huge page allocations as a result of page faults" and
"They would like to 'catch' any tasks that (incorrectly) fault in a
page after hole punch".

I confess I've not done my userfaultfd homework: perhaps it does give
you very good reason for this hugetlbfs holepunch-fault race series, but
I'd really like you to explain that in more detail, before I can Ack it.

But it sounds to me more as if the holes you want punched are not
quite like on other filesystems, and you want to be able to police
them afterwards with userfaultfd, to prevent them from being refilled.

Can't userfaultfd be used just slightly earlier, to prevent them from
being filled while doing the holepunch? Then no need for this patchset?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/