Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

From: Mike Kravetz
Date: Thu Aug 25 2022 - 13:01:21 EST


On 08/24/22 10:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size. When discovered early, it is easy to abort the fault and
> return an error. However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache. Backing out changes in such instances
> becomes difficult and error prone.
>
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults. The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch). Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
> fault. After fault code takes mutex, it will notice fault beyond
> i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
> checks in fault code. fault code will add page beyond i_size.
> When truncate code takes mutex for page/index, it will remove the
> page.
> - truncate updates i_size, but fault code obtains mutex first. If
> fault code sees updated i_size it will abort early. If fault code
> does not see updated i_size, it will add page beyond i_size and
> truncate code will remove page when it obtains fault mutex.
>
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups. For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
>
<snip>
> /*
> * remove_inode_hugepages handles two distinct cases: truncation and hole
> * punch. There are subtle differences in operation for each case.
> @@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> * truncation is indicated by end of range being LLONG_MAX
> * In this case, we first scan the range and release found pages.
> * After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
> - * maps and global counts. Page faults can not race with truncation
> - * in this routine. hugetlb_no_page() prevents page faults in the
> - * truncated range. It checks i_size before allocation, and again after
> - * with the page table lock for the page held. The same lock must be
> - * acquired to unmap a page.
> + * maps and global counts. Page faults can race with truncation.
> + * During faults, hugetlb_no_page() checks i_size before page allocation,
> + * and again after obtaining page table lock. It will 'back out'
> + * allocations in the truncated range.
> * hole punch is indicated if end is not LLONG_MAX
> * In the hole punch case we scan the range and release found pages.
> * Only when releasing a page is the associated region/reserve map
> @@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> * This is indicated if we find a mapped page.
> * Note: If the passed end of range value is beyond the end of file, but
> * not LLONG_MAX this routine still performs a hole punch operation.
> + *
> + * Since page faults can race with this routine, care must be taken as both
> + * modify huge page reservation data. To somewhat synchronize these operations
> + * the hugetlb fault mutex is taken for EVERY index in the range to be hole
> + * punched or truncated. In this way, we KNOW either:
> + * - fault code has added a page beyond i_size, and we will remove here
> + * - fault code will see updated i_size and not add a page beyond
> + * The parameter 'lm__end' indicates the offset of the end of hole or file
> + * before truncation. For hole punch lm_end == lend.
> */
> static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> - loff_t lend)
> + loff_t lend, loff_t lm_end)
> {
> struct hstate *h = hstate_inode(inode);
> struct address_space *mapping = &inode->i_data;
> const pgoff_t start = lstart >> huge_page_shift(h);
> const pgoff_t end = lend >> huge_page_shift(h);
> + pgoff_t m_end = lm_end >> huge_page_shift(h);
> + pgoff_t m_start, m_index;
> struct folio_batch fbatch;
> + struct folio *folio;
> pgoff_t next, index;
> - int i, freed = 0;
> + unsigned int i;
> + long freed = 0;
> + u32 hash;
> bool truncate_op = (lend == LLONG_MAX);
>
> folio_batch_init(&fbatch);
> - next = start;
> + next = m_start = start;
> while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> - struct folio *folio = fbatch.folios[i];
> - u32 hash = 0;
> + folio = fbatch.folios[i];
>
> index = folio->index;
> - hash = hugetlb_fault_mutex_hash(mapping, index);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> -
> /*
> - * If folio is mapped, it was faulted in after being
> - * unmapped in caller. Unmap (again) now after taking
> - * the fault mutex. The mutex will prevent faults
> - * until we finish removing the folio.
> - *
> - * This race can only happen in the hole punch case.
> - * Getting here in a truncate operation is a bug.
> + * Take fault mutex for missing folios before index,
> + * while checking folios that might have been added
> + * due to a race with fault code.
> */
> - if (unlikely(folio_mapped(folio))) {
> - BUG_ON(truncate_op);
> -
> - i_mmap_lock_write(mapping);
> - hugetlb_vmdelete_list(&mapping->i_mmap,
> - index * pages_per_huge_page(h),
> - (index + 1) * pages_per_huge_page(h),
> - ZAP_FLAG_DROP_MARKER);
> - i_mmap_unlock_write(mapping);
> - }
> + freed += fault_lock_inode_indicies(h, inode, mapping,
> + m_start, m_index, truncate_op);

This should be 'index' instead of 'm_index' as discovered here:
https://lore.kernel.org/linux-mm/CA+G9fYsHVdu0toduQqk6vsR8Z8mOVzZ9-_p3O5fjQ5mOpSxsDA@xxxxxxxxxxxxxx/

--
Mike Kravetz