Re: [PATCH v2 07/17] fs: new infrastructure for writeback error handling and reporting

From: Jeff Layton
Date: Wed Apr 12 2017 - 14:42:10 EST


My apologies, this patch in particular should have gotten an updated
changelog. Here's a revised patch. The only real difference in this is
the updated changelog.

----------------------------8<--------------------------------

[PATCH] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, there is no reason
for me to assume that it is because some previous writeback failed. The
fact that it also clears out the error such that a subsequent fsync
returns 0 is a bug, IMO, and a nasty one since that's potentially silent
data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set defines a new 32-bit value (wb_err_t) that encompasses an
error code (up to MAX_ERROR), a sequence counter and a "seen" flag.

One of these is added to struct address_space, and a corresponding one
is added to struct file. When errors are reported during writeback, we
set the error field, and clear the seen flag and increment the sequence
counter if the seen flag is set.

On fsync we can check the file's value against what's in the mapping and
quickly return 0 if it hasn't changed. If it has changed, we'll set the
seen flag if it's not already set, update the value in the struct file
to the latest and return an error.

This changes the semantics of fsync such that applications can now use
it to determine whether there were any writeback errors since fsync(fd)
was last called (or since the file was opened in the case of fsync
having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The basic idea here is for filesystems to use filemap_set_wb_error to
set the error in the mapping when there are writeback errors, and then
have the fsync and flush operations call filemap_report_wb_error just
before returning to ensure that those errors get reported properly.

Eventually, it may make sense to move the reporting into the generic
vfs_fsync_range helper, but doing it this way for now makes it simpler
to convert filesystems to the new API individually.

Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
Documentation/filesystems/vfs.txt | 10 +-
fs/open.c | 3 +
include/linux/fs.h | 23 +++++
mm/filemap.c | 209 ++++++++++++++++++++++++++++++++++++++
4 files changed, 244 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94dd27ef4a76..ed06fb39822b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,6 +576,11 @@ should clear PG_Dirty and set PG_Writeback. It can be actually
written at any point after PG_Dirty is clear. Once it is known to be
safe, PG_Writeback is cleared.

+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application when an
+fsync is issued.
+
Writeback makes use of a writeback_control structure...

struct address_space_operations
@@ -888,7 +893,10 @@ otherwise noted.

release: called when the last reference to an open file is closed

- fsync: called by the fsync(2) system call
+ fsync: called by the fsync(2) system call. Filesystems that use the
+ pagecache should call filemap_report_wb_error before returning
+ to ensure that any errors that occurred during writeback are
+ reported and the file's error sequence advanced.

fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..88bfed8d3c88 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
f->f_inode = inode;
f->f_mapping = inode->i_mapping;

+ /* Ensure that we skip any errors that predate opening of the file */
+ f->f_wb_err = filemap_sample_wb_error(f->f_mapping);
+
if (unlikely(f->f_flags & O_PATH)) {
f->f_mode = FMODE_PATH;
f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..c5ab4c982839 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -376,6 +376,16 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);

+/*
+ * A wb_err_t is a combination of error value, sequence counter and flag that
+ * is used to track errors that occur during writeback. When a new writeback
+ * error occurs, we set the error field in it and increment the sequence
+ * counter if the current value has been fetched since it was last set.
+ *
+ * See the filemap_*_wb_error functions for details.
+ */
+typedef u32 wb_err_t;
+
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
@@ -394,6 +404,7 @@ struct address_space {
gfp_t gfp_mask; /* implicit gfp mask for allocations */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
+ wb_err_t wb_err;
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -846,6 +857,7 @@ struct file {
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
+ wb_err_t f_wb_err;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
@@ -2521,6 +2533,17 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);
extern int filemap_check_errors(struct address_space *mapping);
+extern void __filemap_set_wb_error(struct address_space *mapping, int err);
+extern wb_err_t filemap_sample_wb_error(struct address_space *mapping);
+extern int __must_check filemap_report_wb_error(struct file *file);
+extern int filemap_check_wb_error(struct address_space *mapping, wb_err_t since);
+
+static inline void filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ /* Optimize for the common case of no error */
+ if (unlikely(err))
+ __filemap_set_wb_error(mapping, err);
+}

extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623a6289..4966e9dea945 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -545,6 +545,215 @@ int filemap_write_and_wait_range(struct address_space *mapping,
}
EXPORT_SYMBOL(filemap_write_and_wait_range);

+/*
+ * The wb_err field in the address_space provides a place to store writeback
+ * errors. We endeavor to deliver writeback errors to fsync on all open file
+ * descriptors that were open at the time that the error was caught. We do
+ * this using a 32-bit value to store the error, with the upper bits as a
+ * sequence counter. We can store any error up to MAX_ERRNO.
+ *
+ * Additionally, we reserve one bit to indicate whether any fd has grabbed the
+ * value to record in its struct file. If nothing has, then we don't really
+ * need to increment the counter.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define WB_ERR_SHIFT ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define WB_ERR_SEEN (1 << WB_ERR_SHIFT)
+
+/* Increment the counter by this much to ensure that we don't touch earlier
+ * values */
+#define WB_ERR_CTR_INC (1 << (WB_ERR_SHIFT + 1))
+
+/**
+ * __filemap_set_wb_error - set the wb error in the mapping for later reporting
+ * @mapping: mapping in which the error should be set
+ * @err: error to set. must be negative value but not less than -MAX_ERRNO
+ *
+ * When an error occurs during writeback of inode data, we must report that
+ * error during fsync. This function sets the writeback error field in the
+ * mapping, and increments the sequence counter. When fsync or close is later
+ * performed, the caller can then check the sequence in the mapping against
+ * the one in the file to determine whether the error should be reported.
+ *
+ * Because there are so few bits for the counter, we try to avoid incrementing
+ * it unless someone is going to record the value for later comparison. This
+ * is tracked by a bit in the 32 bit word that we use as a "seen" flag.
+ *
+ * Note that we always use the latest writeback error, since POSIX states
+ * that when there are multiple errors (e.g. -EIO followed by -ENOSPC),
+ * that any possible error may be returned.
+ *
+ * Most callers will want to use the filemap_set_wb_error wrapper to
+ * efficiently handle the common case where err is 0.
+ */
+void __filemap_set_wb_error(struct address_space *mapping, int err)
+{
+ wb_err_t old;
+
+ /* MAX_ERRNO must be able to serve as a mask */
+ BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
+
+ /*
+ * Ensure the error code actually fits where we want it to go. If it
+ * doesn't then just throw a warning and don't record anything. We
+ * also don't accept zero here as that would effectively clear a
+ * previous error.
+ */
+ if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
+ "err = %d\n", err))
+ return;
+
+ old = READ_ONCE(mapping->wb_err);
+ for (;;) {
+ wb_err_t new, cur;
+
+ /* Clear out error bits and set new error */
+ new = (old & ~(MAX_ERRNO|WB_ERR_SEEN)) | -err;
+
+ /* Only increment if someone has looked at it */
+ if (old & WB_ERR_SEEN)
+ new += WB_ERR_CTR_INC;
+
+ /* If there would be no change, then call it done */
+ if (new == old)
+ break;
+
+ /* Try to swap the new value into place */
+ cur = cmpxchg(&mapping->wb_err, old, new);
+
+ /*
+ * Call it success if we did the swap or someone else beat us
+ * to it for the same value.
+ */
+ if (likely(cur == old || cur == new))
+ break;
+
+ /* Raced with an update, try again */
+ old = cur;
+ }
+}
+EXPORT_SYMBOL(__filemap_set_wb_error);
+
+/**
+ * filemap_sample_wb_error - grab current wb_err_t value for mapping
+ * @mapping: the mapping from which to sample the error
+ *
+ * This function allows callers to "sample" the wb_err_t value from the
+ * mapping, marking it as "seen" if required.
+ *
+ * Note that we handle the common case where we've had no writeback errors
+ * as a special case. We don't need to mark the SEEN bit in that case since
+ * an all-zeroed out wb_err_t will only ever exist at first initialization.
+ */
+wb_err_t filemap_sample_wb_error(struct address_space *mapping)
+{
+ wb_err_t old = READ_ONCE(mapping->wb_err);
+ wb_err_t new = old;
+
+ /*
+ * For the common case of no errors ever having been set, we can skip
+ * marking the SEEN bit. Once an error has been set, the value will
+ * never go back to zero.
+ */
+ if (old != 0) {
+ new |= WB_ERR_SEEN;
+ if (old != new)
+ cmpxchg(&mapping->wb_err, old, new);
+ }
+ return new;
+}
+EXPORT_SYMBOL(filemap_sample_wb_error);
+
+/**
+ * filemap_report_wb_error - report wb error (if any) that was previously set
+ * @file: struct file on which the error is being reported
+ *
+ * When userland calls fsync (or something like nfsd does the equivalent), we
+ * want to report any writeback errors that occurred since the last fsync (or
+ * since the file was opened if there haven't been any).
+ *
+ * Grab the wb_err from the mapping. If it matches what we have in the file,
+ * then just quickly return 0. The file is all caught up.
+ *
+ * If it doesn't match, then take the mapping value, set the "seen" flag in
+ * it and try to swap it into place. If it works, or another task beat us
+ * to it with the new value, then update the f_wb_err and return the error
+ * portion. The error at this point must be reported via proper channels
+ * (a'la fsync, or NFS COMMIT operation, etc.).
+ *
+ * While we handle mapping->wb_err with atomic operations, the f_wb_err
+ * value is protected by the f_lock since we must ensure that it reflects
+ * the latest value swapped in for this file descriptor.
+ */
+int filemap_report_wb_error(struct file *file)
+{
+ int err = 0;
+ struct address_space *mapping = file->f_mapping;
+ wb_err_t old, new;
+
+ /*
+ * Catch the common case where nothing has changed without locking
+ *
+ * We always store values with the "seen" bit set (except in the case
+ * where the entire thing is 0), so if this matches what we already
+ * have, then we can call it done. There is nothing to update in that
+ * case.
+ */
+ if (likely(READ_ONCE(mapping->wb_err) == READ_ONCE(file->f_wb_err)))
+ goto out;
+
+ /* Something changed, must use slow path */
+ spin_lock(&file->f_lock);
+ /* Fetch again to make sure we get the latest */
+ old = READ_ONCE(mapping->wb_err);
+
+ if (likely(old != file->f_wb_err)) {
+ /*
+ * Set the flag and try to swap it into place if it has
+ * changed.
+ *
+ * We don't care about the outcome of the swap here. If the
+ * swap doesn't occur, then it has either been updated by a
+ * writer who is bumping the seq count anyway, or another
+ * reader who is just setting the "seen" flag. Either outcome
+ * is OK here, and we can advance f_wb_err and return an
+ * error based on what we have.
+ */
+ new = old | WB_ERR_SEEN;
+ if (new != old)
+ cmpxchg(&mapping->wb_err, old, new);
+ file->f_wb_err = new;
+ err = -(new & MAX_ERRNO);
+ }
+ spin_unlock(&file->f_lock);
+out:
+ return err;
+}
+EXPORT_SYMBOL(filemap_report_wb_error);
+
+/**
+ * filemap_check_wb_error - has an error occurred since the mark was sampled?
+ * @mapping: mapping to check for writeback errors
+ * @since: previously-sampled wb_err_t
+ *
+ * Grab the wb_err_t value from the mapping, and see if it has changed "since"
+ * the given value was sampled.
+ *
+ * If it has then report the latest error set, otherwise return 0.
+ */
+int filemap_check_wb_error(struct address_space *mapping, wb_err_t since)
+{
+ wb_err_t cur = READ_ONCE(mapping->wb_err);
+
+ if (likely(cur == since))
+ return 0;
+ return -(cur & MAX_ERRNO);
+}
+EXPORT_SYMBOL(filemap_check_wb_error);
+
/**
* replace_page_cache_page - replace a pagecache page with a new one
* @old: page to be replaced
--
2.9.3

--
Jeff Layton <jlayton@xxxxxxxxxx>