Re: [PATCH 00/12] Change readahead API

From: Matthew Wilcox
Date: Thu Feb 13 2020 - 08:43:23 EST


On Wed, Feb 12, 2020 at 08:38:52PM -0800, Andrew Morton wrote:
> On Fri, 24 Jan 2020 17:35:41 -0800 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> > From: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx>
> >
> > This series adds a readahead address_space operation to eventually
> > replace the readpages operation. The key difference is that
> > pages are added to the page cache as they are allocated (and
> > then looked up by the filesystem) instead of passing them on a
> > list to the readpages operation and having the filesystem add
> > them to the page cache. It's a net reduction in code for each
> > implementation, more efficient than walking a list, and solves
> > the direct-write vs buffered-read problem reported by yu kuai at
> > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@xxxxxxxxxx/
>
> Unclear which patch fixes this and how it did it?

I suppose the problem isn't fixed until patch 13/13 is applied.
What yu kuai is seeing is a race where readahead allocates a page,
then passes it to iomap_readpages, which calls xfs_read_iomap_begin()
which looks up the extent. Then thread 2 does DIO which modifies the
extent, because there's nothing to say that thread 1 is still using it.
With this patch series, the readpages code puts the locked pages in the
cache before calling iomap_readpages, so any racing write will block on
the locked page until readahead is completed.

If you're tempted to put this into -mm, I have a couple of new changes;
one to fix a kernel-doc warning for mpage_readahead() and one to add
kernel-doc for iomap_readahead():

+++ b/fs/mpage.c
@@ -339,9 +339,7 @@

/**
* mpage_readahead - start reads against pages
- * @mapping: the address_space
- * @start: The number of the first page to read.
- * @nr_pages: The number of consecutive pages to read.
+ * @rac: Describes which pages to read.
* @get_block: The filesystem's block mapper function.
*
* This function walks the pages and the blocks within each page, building and

+++ b/fs/iomap/buffered-io.c
@@ -395,6 +395,21 @@
return done;
}

+/**
+ * iomap_readahead - Attempt to read pages from a file.
+ * @rac: Describes the pages to be read.
+ * @ops: The operations vector for the filesystem.
+ *
+ * This function is for filesystems to call to implement their readahead
+ * address_space operation.
+ *
+ * Context: The file is pinned by the caller, and the pages to be read are
+ * all locked and have an elevated refcount. This function will unlock
+ * the pages (once I/O has completed on them, or I/O has been determined to
+ * not be necessary). It will also decrease the refcount once the pages
+ * have been submitted for I/O. After this point, the page may be removed
+ * from the page cache, and should not be referenced.
+ */
void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
{
struct inode *inode = rac->mapping->host;

I'll do a v6 with those changes soon, but I would really like a bit more
review from filesystem people, particularly ocfs2 and gfs2.