Re: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

From: Jan Kara
Date: Thu May 05 2016 - 11:48:21 EST


On Thu 05-05-16 07:27:48, Christoph Hellwig wrote:
> On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote:
> > We cannot easily do this currently - the reason is that in several places we
> > wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
> > holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
> > logic with this patch.
> >
> > If we indeed put all writes under i_mutex, this problem would go away but
> > as Dave explains in his email, we consciously do as much IO as we can
> > without i_mutex to allow reasonable scalability of multiple writers into
> > the same file.
>
> So the above should be fine for xfs, but you're telling me that ext4
> is doing DAX I/O without any inode lock at all? In that case it's
> indeed not going to work.

By default ext4 uses i_mutex to serialize both direct (and thus dax) reads
and writes. However with dioread_nolock mount option, we use only i_data_sem
(ext4 local rwsem) for direct reads and overwrites. That is enough to
guarantee ext4 metadata consistency and gives you better scalability but
you lose write vs read and write vs write atomicity (essentially you get
the same behavior as for XFS direct IO).

> > The downside of that is that overwrites and writes vs reads are not atomic
> > wrt each other as POSIX requires. It has been that way for direct IO in XFS
> > case for a long time, with DAX this non-conforming behavior is proliferating
> > more. I agree that's not ideal but serializing all writes on a file is
> > rather harsh for persistent memory as well...
>
> For non-O_DIRECT I/O it's simply required..

Well, we already break write vs read atomicity for buffered IO for all
filesystems except XFS which has its special locking. So that's not a new
thing. I agree that also breaking write vs write atomicity for 'normal' IO
is a new thing, in a way more serious as the corrupted result ends up being
stored on disk, and some applications may be broken by that. So we should
fix that.

I was hoping that Davidlohr would come up with a more scalable
range-locking implementation than my original RB-tree based one and we
could use that but that seems to be taking longer than I originally
expected...

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR