[RFC PATCH v3 4/8] hugetlbfs: catch and handle truncate racing with page faults

From: Mike Kravetz
Date: Sun May 08 2022 - 15:06:15 EST


Most hugetlb fault handling code checks for faults beyond i_size.
While there are early checks in the code paths, the most difficult
to handle are those discovered after taking the page table lock.
At this point, we have possibly allocated a page and consumed
associated reservations and possibly added the page to the page cache.

When discovering a fault beyond i_size, be sure to:
- Remove the page from page cache, else it will sit there until the
file is removed.
- Do not restore any reservation for the page consumed. Otherwise
there will be an outstanding reservation for an offset beyond the
end of file.

The 'truncation' code in remove_inode_hugepages must deal with fault
code potentially removing a page from the cache after the page was
returned by pagevec_lookup and before locking the page. This can be
discovered by a change in page_mapping() after taking page lock. In
addition, this code must deal with fault code potentially consuming
and returning reservations. To synchronize this, remove_inode_hugepages
will now take the fault mutex for ALL indicies in the hole or truncated
range. In this way, it KNOWS fault code has finished with the page/index
OR fault code will see the updated file size.

Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
---
fs/hugetlbfs/inode.c | 104 +++++++++++++++++++++++++++++++------------
mm/hugetlb.c | 39 ++++++++++++----
2 files changed, 105 insertions(+), 38 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 96ff9ba2b4ba..5645353a9744 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -444,11 +444,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts. Page faults can race with truncation.
+ * During faults, hugetlb_no_page() checks i_size before page allocation,
+ * and again after obtaining page table lock. It will 'back out'
+ * allocations in the truncated range.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserve map
@@ -457,14 +456,26 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* This is indicated if we find a mapped page.
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data. To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated. In this way, we KNOW fault code will either have
+ * completed backout operations under the mutex, or fault code will see the
+ * updated file size and not allocate a page for offsets beyond truncated size.
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation. For hole punch lm_end == lend.
*/
static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
- loff_t lend)
+ loff_t lend, loff_t lm_end)
{
+ u32 hash;
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
const pgoff_t end = lend >> huge_page_shift(h);
+ pgoff_t m_end = lm_end >> huge_page_shift(h);
+ pgoff_t m_start, m_index;
struct pagevec pvec;
pgoff_t next, index;
int i, freed = 0;
@@ -476,14 +487,33 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
/*
* When no more pages are found, we are done.
*/
- if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1))
+ m_start = next;
+ if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1)) {
+ /*
+ * To synchronize with faults, take fault mutex for
+ * each index in range.
+ */
+ for (m_index = m_start; m_index < m_end; m_index++) {
+ hash = hugetlb_fault_mutex_hash(mapping,
+ m_index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
break;
+ }

for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash = 0;

index = page->index;
+ /* Take fault mutex for missing pages before index */
+ for (m_index = m_start; m_index < index; m_index++) {
+ hash = hugetlb_fault_mutex_hash(mapping,
+ m_index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
+ m_start = index + 1;
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

@@ -492,13 +522,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
@@ -509,27 +534,46 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,

lock_page(page);
/*
- * We must free the huge page and remove from page
- * cache BEFORE removing the region/reserve map
- * (hugetlb_unreserve_pages). In rare out of memory
- * conditions, removal of the region/reserve map could
- * fail. Correspondingly, the subpool and global
- * reserve usage count can need to be adjusted.
+ * After locking page, make sure mapping is the same.
+ * We could have raced with page fault populate and
+ * backout code.
*/
- VM_BUG_ON(HPageRestoreReserve(page));
- hugetlb_delete_from_page_cache(page);
- freed++;
- if (!truncate_op) {
- if (unlikely(hugetlb_unreserve_pages(inode,
+ if (page_mapping(page) == mapping) {
+ /*
+ * We must free the huge page and remove from
+ * page cache BEFORE removing the region/
+ * reserve map (hugetlb_unreserve_pages). In
+ * rare out of memory conditions, removal of
+ * the region/reserve map could fail.
+ * Correspondingly, the subpool and global
+ * reserve usage count can need to be adjusted.
+ */
+ VM_BUG_ON(HPageRestoreReserve(page));
+ hugetlb_delete_from_page_cache(page);
+ freed++;
+ if (!truncate_op) {
+ if (unlikely(
+ hugetlb_unreserve_pages(inode,
index, index + 1, 1)))
- hugetlb_fix_reserve_counts(inode);
+ hugetlb_fix_reserve_counts(
+ inode);
+ }
}
-
unlock_page(page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
+
+ if (!(next < end)) {
+ /* Will exit loop, take mutex for indicies up to end */
+ for (m_index = m_start; m_index < m_end; m_index++) {
+ hash = hugetlb_fault_mutex_hash(mapping,
+ m_index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
+ }
}

if (truncate_op)
@@ -539,8 +583,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
struct resv_map *resv_map;
+ loff_t prev_size = i_size_read(inode);

- remove_inode_hugepages(inode, 0, LLONG_MAX);
+ remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);

/*
* Get the resv_map from the address space embedded in the inode.
@@ -560,6 +605,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
+ loff_t prev_size = i_size_read(inode);

BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
@@ -570,7 +616,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
ZAP_FLAG_DROP_MARKER);
i_mmap_unlock_write(mapping);
- remove_inode_hugepages(inode, offset, LLONG_MAX);
+ remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
}

static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
@@ -604,7 +650,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
- remove_inode_hugepages(inode, hole_start, hole_end);
+ remove_inode_hugepages(inode, hole_start, hole_end, hole_end);
inode_unlock(inode);
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 003df6cc13eb..007b39450f71 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5453,6 +5453,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ bool beyond_i_size = false;
+ bool reserve_alloc = false;

/*
* Currently, we are forced to kill the process in the event the
@@ -5510,6 +5512,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
new_page = true;
+ if (HPageRestoreReserve(page))
+ reserve_alloc = true;

if (vma->vm_flags & VM_MAYSHARE) {
int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5568,8 +5572,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
+ if (idx >= size) {
+ beyond_i_size = true;
goto backout;
+ }

ret = 0;
/* If pte changed from under us, retry */
@@ -5614,10 +5620,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
backout:
spin_unlock(ptl);
backout_unlocked:
+ if (new_page) {
+ if (new_pagecache_page)
+ hugetlb_delete_from_page_cache(page);
+
+ /*
+ * If reserve was consumed, make sure flag is set so that it
+ * will be restored in free_huge_page().
+ */
+ if (reserve_alloc)
+ SetHPageRestoreReserve(page);
+
+ /*
+ * Do not restore reserve map entries beyond i_size.
+ * Otherwise, there will be leaks when the file is removed.
+ */
+ if (!beyond_i_size)
+ restore_reserve_on_error(h, vma, haddr, page);
+ }
unlock_page(page);
- /* restore reserve for newly allocated pages not in page cache */
- if (new_page && !new_pagecache_page)
- restore_reserve_on_error(h, vma, haddr, page);
put_page(page);
goto out;
}
@@ -5938,15 +5959,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* Recheck the i_size after holding PT lock to make sure not
* to leave any page mapped (as page_mapped()) beyond the end
* of the i_size (remove_inode_hugepages() is strict about
- * enforcing that). If we bail out here, we'll also leave a
- * page in the radix tree in the vm_shared case beyond the end
- * of the i_size, but remove_inode_hugepages() will take care
- * of it as soon as we drop the hugetlb_fault_mutex_table.
+ * enforcing that). If we bail out here, remove the page
+ * added to the radix tree.
*/
size = i_size_read(mapping->host) >> huge_page_shift(h);
ret = -EFAULT;
- if (idx >= size)
+ if (idx >= size) {
+ hugetlb_delete_from_page_cache(page);
goto out_release_unlock;
+ }

ret = -EEXIST;
/*
--
2.35.3