Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

From: Andy Lutomirski
Date: Mon Jun 19 2017 - 11:24:07 EST


On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> >> My other objection is that the syscall intentionally leaks a reference
>> >> to the file. This means it needs overflow protection and it probably
>> >> shouldn't ever be allowed to use it without privilege.
>> >
>> > We only hold the one reference while S_DAXFILE is set, so I think the
>> > protection is there, and per Dave's original proposal this requires
>> > CAP_LINUX_IMMUTABLE.
>> >
>> >> Why can't the underlying issue be easily fixed, though? Could
>> >> .page_mkwrite just make sure that metadata is synced when the FS uses
>> >> DAX?
>> >
>> > Yes, it most definitely could and that idea has been floated.
>> >
>> >> On a DAX fs, syncing metadata should be extremely fast.
>
> <sigh>
>
> This again....
>
> Persistent memory means the *I/O* is fast. It does not mean that
> *complex filesystem operations* are fast.
>
> Don't forget that there's an shitload of CPU that gets burnt to make
> sure that the metadata is synced correctly. Do that /synchronously/
> on *every* write page fault (which, BTW, modify mtime, so will
> always have dirty metadata to sync) and now you have a serious
> performance problem with your "fast" DAX access method.

I think the mtime issue can and should be solved separately. But it'
s a fair point that there would be workloads for which this could be
excessively expensive. In particular, simply creating a file,
mmapping a large range, and touching the pages one by one -- delalloc
would be completely defeated.

But here's a strawman for solving both issues. First, mtime. I
consider it to be either a bug or a misfeature that .page_mkwrite
*ever* dirties an inode just to update mtime. I have old patches to
fix this, and those patches could be updated and merged. With them
applied, there's just a set_bit() in .page_mkwrite() to handle mtime.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Second: syncing extents. Here's a straw man. Forget the mmap() flag.
Instead add a new msync() operation:

msync(start, length, MSYNC_PMEM_PREPARE_WRITE);

If this operation succeeds, it guarantees that all future writes
through this mapping on this range will hit actual storage and that
all the metadata operations needed to make this write persistent will
hit storage such that they are ordered before the user's writes.

As an implementation detail, this will flush out the extents if
needed. In addition, if the FS has any mechanism that would cause
problems asyncronously later on (dedupe? deallocated extents full of
zeros? defrag?), it may also need to set a flag on the VMA that
changes the behavior of future .page_mkwrite operations.

(On x86, for example, this would permit the FS to do WC/streaming
writes without SFENCE if the FS were structured in a way that this
worked.)

Now we have an API that should work going forward without introducing
baggage. And XFS is free to implement this API by making the entire
file act like a swap file if XFS wants to do so, but this doesn't
force other filesystems (ext4? NOVA?) to do the same thing.

>
> And that's before we even consider all the problems with running
> sync operations in page fault context....
>
>> >> This
>> >> could be conditioned on an madvise or mmap flag if performance might
>> >> be an issue. As far as I know, this change alone should be
>> >> sufficient.
>> >
>> > The hang up is that it requires per-fs enabling as it needs to be
>> > careful to manage mmap_sem vs fs journal locks for example. I know the
>> > in-development NOVA [1] filesystem is planning to support this out of
>> > the gate. ext4 would be open to implementing it, but I think xfs is
>> > cold on the idea. Christoph originally proposed it here [2], before
>> > Dave went on to propose immutable semantics.
>>
>> Hmm. Given a choice between a very clean API that works without
>> privilege but is awkward to implement on XFS and an awkward-to-use
>> API, I'd personally choose the former.
>
> Yup, you have the choice of a clean kernel API that will be
> substantially slower than the existing "dirty page" tracking and
> having the app run fsync() when necessary, or having to do a little
> more work in a library routine that preallocates a file and sets a
> flag on it?
>
> The apps will use the library API, not the kernel API, so who really
> cares if there's a few steps to setting up the file state
> appropriately?
>
>> Dave, even with the lock ordering issue, couldn't XFS implement
>> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
>>
>> if (metadata is dirty) {
>> up_write(&mmap_sem);
>> sync the metadata;
>> down_write(&mmap_sem);
>> return 0; /* retry the fault */
>> } else {
>> return whatever success code;
>> }
>
> How do you know that there is dependent filesystem metadata that
> needs syncing at a level that you can safely manipulate the
> mmap_sem? And how, exactly, do you do this without races?

I have no idea, but I expect that all the locking issues are solvable.

> It'd be
> trivial to DOS such retryable DAX faults simply by touching the file
> in a tight loop in a separate process...

If the code were smart enough to only cause a retry when the extent
being touched is dirty, this problem wouldn't exist.