Re: [PATCH 5/5] block: enable dax for raw block devices

From: Dave Chinner
Date: Mon Oct 26 2015 - 02:23:42 EST


On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> >> filesystems since there's nothing which writeprotects pages that are
> >> writeably mapped. In normal path, page writeback does this but that doesn't
> >> happen for DAX. I remember we once talked about this but it got lost.
> >> We need something like walk all filesystem inodes during fs freeze and
> >> writeprotect all pages that are mapped. But that's going to be slow...
> >
> > fsync() has the same problem - we have no record of the pages that
> > need to be committed and then write protected when fsync() is called
> > after write()...
>
> I know Ross is still working on that implementation. However, I had a
> thought on the flight to ksummit that maybe we shouldn't worry about
> tracking dirty state on a per-page basis. For small / frequent
> synchronizations an application really should be using the nvml
> library [1] to issue cache flushes and pcommit from userspace on a
> per-cacheline basis. That leaves unmodified apps that want to be
> correct in the presence of dax mappings. Two things we can do to
> mitigate that case:
>
> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
> flag. Applications shouldn't silently become incorrect simply because
> the fs is mounted with -o dax. If an app doesn't understand DAX
> mappings it should get page-cache semantics. This also protects apps
> that are not expecting DAX semantics on raw block device mappings.

Which is the complete opposite of what we are trying to acehive with
DAX. i.e. that existing applications "just work" with DAX without
modification. So this is a non-starter.

Also, DAX access isn't a property of mmap - it's a property
of the inode. We cannot do DAX access via mmap while mixing page
cache based access through file descriptor based interfaces. This
I why I'm adding an inode attribute (on disk) to enable per-file DAX
capabilities - either everything is via the DAX paths, or nothing
is.

> 2/ Even if we get a new flag that lets the kernel know the app
> understands DAX mappings, we shouldn't leave fsync broken. Can we
> instead get by with a simple / big hammer solution? I.e.

Because we don't physically have to write back data the problem is
both simpler and more complex. The simplest solution is for the
underlying block device to implement blkdev_issue_flush() correctly.

i.e. if blkdev_issue_flush() behaves according to it's required
semantics - that all volatile cached data is flushed to stable
storage - then fsync-on-DAX will work appropriately. As it is, this is
needed for journal based filesystems to work correctly, as they are
assuming that their journal writes are being treated correctly as
REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
ordering is maintained....

So, to begin with, this problem needs to be solved at the block
device level. That's the simple, brute-force, big hammer solution to
the problem, and it requires no changes at the filesystem level at
all.

However, to avoid having to flush the entire block device range on
fsync we need a much more complex solution that tracks the dirty
ranges of the file and hence what needs committing when fsync is
run....

> Disruptive, yes, but if an app cares about efficient persistent memory
> synchronization fsync is already the wrong api.

I don't really care about efficiency right now - correctness comes
first. Fundamentally, the app should not care whether it is writing to
persistent memory or spinning rust - the filesystem needs to
provide the application with exactly the same integrity guarantees
regardless of the underlying storage.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/