Re: ext3: call blkdev_issue_flush on fsync

From: Jan Kara
Date: Wed Jan 28 2009 - 04:55:34 EST


On Wed 28-01-09 18:45:13, Fernando Luis Vázquez Cao wrote:
> On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> > On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> > > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > > not being synched or barriers are disabled).
> > > > > >
> > > > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@xxxxxxxxxxxxx>
> > > > > > ---
> > > > > Only two minor nits:
> > > > >
> > > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > > > > @@ -27,6 +27,7 @@
> > > > > > #include <linux/sched.h>
> > > > > > #include <linux/writeback.h>
> > > > > > #include <linux/jbd.h>
> > > > > > +#include <linux/blkdev.h>
> > > > > > #include <linux/ext3_fs.h>
> > > > > > #include <linux/ext3_jbd.h>
> > > > > >
> > > > > > @@ -45,6 +46,8 @@
> > > > > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > > > {
> > > > > > struct inode *inode = dentry->d_inode;
> > > > > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > > + unsigned long i_state = inode->i_state;
> > > > > > int ret = 0;
> > > > > >
> > > > > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > > > */
> > > > > > if (ext3_should_journal_data(inode)) {
> > > > > > ret = ext3_force_commit(inode->i_sb);
> > > > > > + if (!(journal->j_flags & JFS_BARRIER))
> > > > > > + goto no_journal_barrier;
> > > > > > goto out;
> > > > > > }
> > > > > >
> > > > > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > > - goto out;
> > > > > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > > + goto flush_blkdev;
> > > > > >
> > > > > > /*
> > > > > > * The VFS has written the file data. If the inode is unaltered
> > > > > > * then we need not start a commit.
> > > > > > */
> > > > > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > struct writeback_control wbc = {
> > > > > > .sync_mode = WB_SYNC_ALL,
> > > > > > .nr_to_write = 0, /* sys_fsync did this */
> > > > > > };
> > > > > > ret = sync_inode(inode, &wbc);
> > > > > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > > + goto no_journal_barrier;
> > > > > I cannot imagine "journal" will be NULL here.
> > > >
> > > > I'll try to check whether that is always so just in case.
> > > >
> > > > > And we can also optimize here a bit and do "goto out" because here
> > > > > we know the barrier has been issued.
> > > >
> > > > Yep, I was considering the same optimization. By the way, I was
> > > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> > >
> > > The last phrase should read " do not force a flush when "barrier=0" ".
> > I was hesitating about this a bit. But I don't think so. The reason is
> > that POSIX (or any other reasonable specification) mandates that fsync()
> > should return only after the data is safely on storage. So if we don't
> > flush blockdevice caches, we effectively violate POSIX and we should never
> > do that. With barriers the matter is a bit different - that is just a
> > filesystem specific thing, no standard guarantees anything.
>
> Hi Jan,
>
> Sorry it's taken me so long to get back to you.
>
> Thinking a lit bit more about this issue, it occurred to me that adding
> a new mount option à la existing "barrier" is likely to be preferable.
> As an example where such an option could make sense, let's consider a
> system with battery-backup cache devices. Since the battery-backup
> guarantees the data still not committed to the platter will not vanish
> in the event of a power down, it should be possible to obtain a
> performance gain by optimizing out the device flush on every
> fsync()/fdatasync() call.
Yes, I came to this conclusion as well when we were discussing how to fix
other filesystems which fail to issue flush in fsync
(https://kerneltrap.org/mailarchive/linux-ext4/2009/1/22/4788004/thread).
I think it makes sence there's a general mount option recognized at VFS
level which would set superblock flag whether flush on fsync is needed or
not. Then you could use this flag in your patch.
So if you have time, please write a (separate) patch introducing this
generic option. Thanks.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/