Re: [PATCH v2 2/4] mm: add file_fdatawait_range and file_write_and_wait

From: Steven Whitehouse
Date: Mon Jul 31 2017 - 08:05:43 EST


Hi,


On 31/07/17 12:44, Jeff Layton wrote:
On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
Hi,


On 31/07/17 12:27, Jeff Layton wrote:
On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
On Wed 26-07-17 13:55:36, Jeff Layton wrote:
+int file_write_and_wait(struct file *file)
+{
+ int err = 0, err2;
+ struct address_space *mapping = file->f_mapping;
+
+ if ((!dax_mapping(mapping) && mapping->nrpages) ||
+ (dax_mapping(mapping) && mapping->nrexceptional)) {
+ err = filemap_fdatawrite(mapping);
+ /* See comment of filemap_write_and_wait() */
+ if (err != -EIO) {
+ loff_t i_size = i_size_read(mapping->host);
+
+ if (i_size != 0)
+ __filemap_fdatawait_range(mapping, 0,
+ i_size - 1);
+ }
+ }
Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
range and ignore i_size. It is much easier than trying to wrap your head
around possible races with file operations modifying i_size.

Honza
I'm basically emulating _exactly_ what filemap_write_and_wait does here,
as I'm leery of making subtle behavior changes in the actual writeback
behavior. For example:

-----------------8<----------------
static inline int __filemap_fdatawrite(struct address_space *mapping,
int sync_mode)
{
return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
}

int filemap_fdatawrite(struct address_space *mapping)
{
return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
}
EXPORT_SYMBOL(filemap_fdatawrite);
-----------------8<----------------

...which then sets up the wbc with the right ranges and sync mode and
kicks off writepages. But then, it does the i_size_read to figure out
what range it should wait on (with the shortcut for the size == 0 case).

My assumption was that it was intentionally designed that way, but I'm
guessing from your comments that it wasn't? If so, then we can turn
file_write_and_wait a static inline wrapper around
file_write_and_wait_range.
FWIW, I did a bit of archaeology in the linux-history tree and found
this patch from Marcelo in 2004. Is this optimization still helpful? If
not, then that does simplify the code a bit.

-------------------8<--------------------

[PATCH] small wait_on_page_writeback_range() optimization

filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
parameter. This is not needed since we know the EOF from the inode. Use
that instead.

Signed-off-by: Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
---
mm/filemap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 78e18b7639b6..55fb7b4141e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
*/
int filemap_fdatawait(struct address_space *mapping)
{
- return wait_on_page_writeback_range(mapping, 0, -1);
+ loff_t i_size = i_size_read(mapping->host);
+
+ if (i_size == 0)
+ return 0;
+
+ return wait_on_page_writeback_range(mapping, 0,
+ (i_size - 1) >> PAGE_CACHE_SHIFT);
}
EXPORT_SYMBOL(filemap_fdatawait);

Does this ever get called in cases where we would not hold fs locks? In
that case we definitely don't want to be relying on i_size,

Steve.

Yes. We can initiate and wait on writeback from any context where you
can sleep, really.

We're just waiting on whole file writeback here, so I don't think
there's anything wrong. As long as the i_size was valid at some point in
time prior to waiting then you're ok.

The question I have is more whether this optimization is still useful.

What we do now is just walk the radix tree and wait_on_page_writeback
for each page. Do we gain anything by avoiding ranges beyond the current
EOF with the pagecache infrastructure of 2017?


If this can be called from anywhere without fs locks, then i_size is not known. That has been a problem in the past since i_size may have changed on another node. We avoid that in this case due to only changing i_size under an exclusive lock, and also only having dirty pages when we have an exclusive lock. There is another case though, if the inode is a block device, i_size will be zero. That is the case for the address space that looks after rgrps for GFS2. We do (luckily!) call filemap_fdatawait_range() directly in that case. For "normal" inodes though, the address space for metadata is backed by the block device inode, so that looks like it might be an issue, since fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the metamapping. It might potentially be an issue in other cases too,

Steve.