Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
From: David Howells
Date: Thu Mar 07 2024 - 05:36:43 EST
Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote:
> > Here's a patch to have a go at getting rid of ->launder_folio(). Since it's
> > failable and cannot guarantee that pages in the range are removed, I've tried
> > to replace laundering with just flush-and-wait, dropping the folio lock around
> > the I/O.
>
> My sense is that ->launder_folio doesn't actually need to be replaced.
>
> commit e3db7691e9f3dff3289f64e3d98583e28afe03db
> Author: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
> Date: Wed Jan 10 23:15:39 2007 -0800
>
> [PATCH] NFS: Fix race in nfs_release_page()
>
> NFS: Fix race in nfs_release_page()
>
> invalidate_inode_pages2() may find the dirty bit has been set on a page
> owing to the fact that the page may still be mapped after it was locked.
> Only after the call to unmap_mapping_range() are we sure that the page
> can no longer be dirtied.
> In order to fix this, NFS has hooked the releasepage() method and tries
> to write the page out between the call to unmap_mapping_range() and the
> call to remove_mapping(). This, however leads to deadlocks in the page
> reclaim code, where the page may be locked without holding a reference
> to the inode or dentry.
>
> Fix is to add a new address_space_operation, launder_page(), which will
> attempt to write out a dirty page without releasing the page lock.
>
> Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
>
> I don't understand why this couldn't've been solved by page_mkwrite.
> NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just
> not needed any more? I haven't looked into it enough to make sure,
> but my belief is that we should be able to get rid of it.
Okay, it's slightly more complex than I thought - and I'm not sure all callers
are actually using it correctly. There are some additional interesting cases
I've found, beyond the pre-/post-DIO case:
(1) NFS relies on it to retry the write before stripping the pages in the
case where a writeback error occurs. I think this can probably be dealt
with by sticking a filemap_fdatawrite() call before the invalidation.
I'm not sure if this would incur the deadlock with the page reclaim code
of which Trond speaks.
(2) invalidate_inode_pages2() is used in some places to effect invalidation
of the pagecache in the case where the server tells us that a third party
modified the server copy of a file. What the right behaviour should be
here, I'm not sure, but at the moment, any dirty data will get laundered
back to the server. Possibly it should be simply invalidated locally or
the user asked how they want to handle the divergence.
Some filesystems use invalidate_remote_inode() instead which seems to
leave the dirty pages in place locally.
If it is desirous to save the dirty data, then filemap_fdatawrite()
could be deployed before invalidating the pages.
(3) Fuse uses invalidate_inode_pages2() in fuse_do_setattr() to strip all the
pages from an inode that has had its size changed, laundering any page
that's still dirty. This would seem to be excessive, but maybe Miklos
had a reason for doing it that way.
There are some places that should perhaps be using kiocb_invalidate_pages()
and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no
objection to the latter function being exported.
David