Re: [RFC 00/11] DAX fsynx/msync support

From: Ross Zwisler
Date: Fri Oct 30 2015 - 14:39:47 EST

On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> > This patch series adds support for fsync/msync to DAX.
> >
> > Patches 1 through 8 add various utilities that the DAX code will eventually
> > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are
> > filesystem changes that are needed after the DAX code is added, but these
> > patches may change slightly as the filesystem fault handling for DAX is
> > being modified ([1] and [2]).
> >
> > I've marked this series as RFC because I'm still testing, but I wanted to
> > get this out there so people would see the direction I was going and
> > hopefully comment on any big red flags sooner rather than later.
> >
> > I realize that we are getting pretty dang close to the v4.4 merge window,
> > but I think that if we can get this reviewed and working it's a much better
> > solution than the "big hammer" approach that blindly flushes entire PMEM
> > namespaces [3].
> We need the "big hammer" regardless of fsync. If REQ_FLUSH and
> REQ_FUA don't do the right thing when it comes to ordering journal
> writes against other IO operations, then the filesystems are not
> crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding
> changes back to stable storage, just like they do for existing
> storage....

I think that what I've got here (when it's fully working) will protect all the
cases that we need.

AFAIK there are three ways that data can be written to a PMEM namespace:

1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or
pmem_rw_bytes(). All of these paths sync the newly written data durably to
media before the I/O completes so they shouldn't have any reliance on

2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly
written data durably to media before the I/O operation completes, so this path
shouldn't have any reliance on REQ_FUA/REQ_FLUSH.

3) Through mmaps set up by DAX. This is the path we are trying to protect
with the dirty page tracking and flushing in this patch set, and I think that
this is the only path that has reliance on REQ_FLUSH.

The goal of this set is to have the cache writeback all happen as part of the
fsync/msync handling, and then have the REQ_FLUSH just provide the trailing

My guess is that XFS metadata writes happen via path 1), down through the PMEM
driver. Am I missing anything, or should we be good to go?

> > [1]
> > [2]
> > [3]
> >
> > Ross Zwisler (11):
> > pmem: add wb_cache_pmem() to the PMEM API
> > mm: add pmd_mkclean()
> > pmem: enable REQ_FLUSH handling
> > dax: support dirty DAX entries in radix tree
> > mm: add follow_pte_pmd()
> > mm: add pgoff_mkclean()
> > mm: add find_get_entries_tag()
> > fs: add get_block() to struct inode_operations
> I don't think this is the right thing to do - it propagates the use
> of bufferheads as a mapping structure into places where we do not
> want bufferheads. We've recently added a similar block mapping
> interface to the export operations structure for PNFS and that uses
> a "struct iomap" which is far more suited to being an inode
> operation this.
> We have plans to move this to the inode operations for various
> reasons. e.g: multipage write, adding interfaces that support proper
> mapping of holes, etc:
> So after many years of saying no to moving getblocks to the inode
> operations it seems like the wrong thing to do now considering I
> want to convert all the DAX code to use iomaps while only 2/3
> filesystems are supported...

Okay, I'll take a look at this interface. I also think that we may need to
flow through the filesystem before going into the DAX code so that we can
serialize our flushing with respect to extent manipulation, as we had to do
with our DAX fault paths.

> > dax: add support for fsync/sync
> Why put the dax_flush_mapping() in do_writepages()? Why not call it
> directly from the filesystem ->fsync() implementations where a
> getblocks callback could also be provided?

Because that's where you put it in your example. :)

Moving it into the filesystem where we know about get_block() is probably the
right thing to do - I'll check it out. Thanks!
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at