Re: [PATCH 5/5] block: enable dax for raw block devices

From: Ross Zwisler
Date: Tue Oct 27 2015 - 18:55:17 EST


On Tue, Oct 27, 2015 at 09:19:30AM +1100, Dave Chinner wrote:
> On Mon, Oct 26, 2015 at 05:56:30PM +0900, Dan Williams wrote:
> > On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > Also, DAX access isn't a property of mmap - it's a property
> > > of the inode. We cannot do DAX access via mmap while mixing page
> > > cache based access through file descriptor based interfaces. This
> > > I why I'm adding an inode attribute (on disk) to enable per-file DAX
> > > capabilities - either everything is via the DAX paths, or nothing
> > > is.
> > >
> >
> > Per-inode control sounds very useful, I'll look at a similar mechanism
> > for the raw block case.
> >
> > However, still not quite convinced page-cache control is an inode-only
> > property, especially when direct-i/o is not an inode-property. That
> > said, I agree the complexity of handling mixed mappings of the same
> > file is prohibitive.
>
> We didn't get that choice with direct IO - support via O_DIRECT was
> kinda inherited from other OS's(*). We still have all sorts of
> coherency problems between buffered/mmap/direct IO on the same file,
> and I'd really, really like to avoid making that same mistake again
> with DAX.
>
> i.e. We have a choice with DAX right now that will allow us to avoid
> coherency problems that we know existi and can't solve right now.
> Making DAX and inode property rather than a application context
> property avoids those coherence problems as all access will play by
> the same rules....
>
> (*)That said, some other OS's did O_DIRECT as an inode property (e.g.
> solaris) where O_DIRECT was only done if no other cached operations
> were required (e.g. mmap), and so the fd would transparently shift
> between buffered and O_DIRECT depending on external accesses to the
> inode. This was not liked because of it's unpredictable effect on
> CPU usage and IO latency....
>
> > Sounds good, get blkdev_issue_flush() functional first and then worry
> > about building a more efficient solution on top.
>
> *nod*

Okay, I'll get this sent out this week. I've been working furiously on the
fsync/msync solution which tracks dirty pages via the radix tree - I guess
I'll send out an RFC version of those patches tomorrow so that we can begin
the review process and any glaring issues can be addressed soon.

That set has grown rather large, though, and I do worry that making it into
v4.4 would be a stretch, although I guess I'm still holding out hope.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/