Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

From: Peter Xu
Date: Tue Apr 04 2023 - 17:22:33 EST


On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@xxxxxxxxxxxx>
>
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
>
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
>
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
> ---
> mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
> 1 file changed, 29 insertions(+), 50 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7679551e9540..a19aa140fd52 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> *
> * Basic scheme is simple, details are more complex:
> * - allocate and lock a new huge page;
> - * - scan page cache replacing old pages with the new one
> + * - scan page cache, locking old pages
> * + swap/gup in pages if necessary;
> - * + keep old pages around in case rollback is required;
> + * - copy data to new page
> + * - handle shmem holes
> + * + re-validate that holes weren't filled by someone else
> + * + check for userfaultfd

PS: some of the changes may belong to previous patch here, but not
necessary to repost only for this, just in case there'll be a new one.

> * - finalize updates to the page cache;
> * - if replacing succeeds:
> - * + copy data over;
> - * + free old pages;
> * + unlock huge page;
> + * + free old pages;
> * - if replacing failed;
> - * + put all pages back and unfreeze them;
> - * + restore gaps in the page cache;
> + * + unlock old pages
> * + unlock and free huge page;
> */
> static int collapse_file(struct mm_struct *mm, unsigned long addr,
> @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> }
> } while (1);
>
> - /*
> - * At this point the hpage is locked and not up-to-date.
> - * It's safe to insert it into the page cache, because nobody would
> - * be able to map it or use it in another way until we unlock it.
> - */
> -
> xas_set(&xas, start);
> for (index = start; index < end; index++) {
> page = xas_next(&xas);
> @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> VM_BUG_ON_PAGE(page != xas_load(&xas), page);
>
> /*
> - * The page is expected to have page_count() == 3:
> + * We control three references to the page:
> * - we hold a pin on it;
> * - one reference from page cache;
> * - one from isolate_lru_page;
> + * If those are the only references, then any new usage of the
> + * page will have to fetch it from the page cache. That requires
> + * locking the page to handle truncate, so any new usage will be
> + * blocked until we unlock page after collapse/during rollback.
> */
> - if (!page_ref_freeze(page, 3)) {
> + if (page_count(page) != 3) {
> result = SCAN_PAGE_COUNT;
> xas_unlock_irq(&xas);
> putback_lru_page(page);

Personally I don't see anything wrong with this change to resolve the dead
lock. E.g. fast gup race right before unmapping the pgtables seems fine,
since we'll just bail out with >3 refcounts (or fast-gup bails out by
checking pte changes). Either way looks fine here.

So far it looks good to me, but that may not mean much per the history on
what I can overlook. It'll be always good to hear from Hugh and others.

--
Peter Xu