Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

From: Dave Chinner
Date: Thu Jun 22 2017 - 21:00:31 EST


On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > You seem to be calling the "fdatasync on every page fault" the
>
> It's the opposite of fdatasync(). It needs to sync whatever metadata
> is needed to find the data. The data doesn't need to be synced.

So much wrong with that statement.

Andy, what does fdatasync() do when you have a data-clean,
metadata-dirty file (e.g. you just punched a hole or preallocated
more space via fallocate())? Hint: it doesn't sync any data
because the mapping tree is clean, but it still syncs the dirty
metadata needed to access the data.

Now, what does a file where we do direct IO writes look like? Yup,
the mapping tree always remains clean and so it's only ever going to
appear to the kernel as a *data-clean, metadata-dirty* file. So,
after a direct IO write is done, what operation do we need to run to
ensure that we can always access the data?

Yup, it's fdatasync().

So, what does a DAX file that does userspace data flushes look like
to the kernel? Yup, again the mapping tree always remains clean and
so it's only ever going to be a *data-clean, metadata-dirty* file.

It should be clear now why I said "fdatasync on every page fault"
because that's exactly the mechanism we'd use to implement this
functionality....

It should also be clear that DAX is not introducing any new data
integrity problems to the filesystems that direct IO hasn't already
introduced. Both DAX with userspace data sync and Direct IO writes
are completely untracked by the kernel. IOWs, direct IO is a form
of "kernel bypass", just like DAX+userspace data sync is. All that
is different is the method by which data is written to the storage
media from userspace, which in the case of DAX is via mmap rather
than read/write.

> > "lightweight" option. That's the brute-force-with-big-hammer
> > solution - it's most definitely not lightweight as every page fault
> > has extra overhead to call ->fsync(). Sure, the API is simple, but
> > the runtime overhead is significant.
>
> It's lightweight in terms of its impact on the filesystem. It doesn't
> need any persistent setup -- you can just use it.

Well, no, that's wrong, because we have to co-ordinate multiple
concurrent accesses to the data in the kernel. What happens when
some other process writes to the file *at the same time* but does
not use userspace sync? We aren't tracking dirty regions on the
inode mapping because we've been told not to do that, so fsync()
from that other process *won't sync the data it wrote*. IOws, the
kernel has failed to provide the guarantee that userspace wants it
to provide.

The single mapping tree is central to the problem here - we can't
mix modes of dirty tracking across different processes. Either
everything uses userspace sync, or everything uses kernel controlled
dirty tracking so fsync() works correctly in all cases. Put simply -
dirty tracking is a per-inode function, not a per-file or per-vma
function.

As the direct IO kernel-bypass model demonstrates, as soon as you
start considering multi-process data coherency and durability with
mixed kernel+kernel bypass methods in play, lots of potential
problems and issues crop up that can't easily be solved by the
kernel or filesystems. We try to minimise the problems, but we don't
guarantee mixed mode coherency (and hence integrity) as we've
delegated data coherency and integrity responsibility to the app
bypassing the kernel data coherency and integrity mechanisms.

What I'd like to avoid is creating another kernel bypass mechanism
where we allow coherency and/or integrity to be fucked up in a way that
we can't fix without giving up all the performance that the kernel
bypass provides userspace apps. Constrain the cases where kernel
bypass is allowed, and we avoid all the crappy corner cases where
our only answer to users with corrupt data is "the man page advises
application developers not to do that".

If in future we work out how to implement everything without
needing immutable extents in the inode, we can relax the
restrictions we've placed on userspace DAX data sync....

> > Even if you are considering the complexity of the APIs, it's hardly
> > a "heavyweight" when it only requires a single call to fallocate()
> > before mmap() to set up the immutable extents on the file...
>
> So what would the exact semantics be? In particular, how can it fail?
> If I do the fallocate(), is it absolutely promised that the extent
> map won't get out of sync between what mmap sees and what's on disk?

That's precisely the guarantee I documented would be given by
immutable extents in my very first proposal.

> Do user programs need to worry about colliding with each other when
> one does fallocate() to DAXify a file and the other does fallocate()
> to unDAXify a file?

Yes, it can. This was one of the reasons for putting it under
privilege - so only the app has full control of the extent map
changes and nobody else can fuck with it.

> Does this particular fallocate() call still keep
> its effect after a reboot?

Yes, it does, because it has to be transparent and behave
consistently with all of userspace, not just the app that owns the
file, and not just while that app is running. (e.g. defrag could be
running on the file before the app starts, and then you're screwed
when defrag modifies the extent map after app startup...)

> Is there an actual concrete proposal that's reviewable?

Yes, the first posting where I proposed this functionality many
months ago spelled this all out in detail.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx