Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()

From: Dan Williams
Date: Thu Apr 27 2017 - 17:36:27 EST


On Thu, Apr 27, 2017 at 1:02 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
>> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>>
>>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
>>>> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>>>>
>>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>>> to do stuff related to an open file descriptor or mapping. I don't take
>>>>>> issue with the name. I do worry that something like 'wpq_drain' may be
>>>>>> too platform specific, though. The NVM Programming Model specification
>>>>>> is going to call this "deep flush", so maybe that will give you
>>>>>> some inspiration if you do want to change the name.
>>>>>
>>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>>> single open file descriptor or mapping. It really is a "region flush"
>>>>> for giving extra protection for global metadata, but the persistence
>>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>>> might give the false impression that every time you flush a cacheline
>>>>> to persistence you need to call the ioctl.
>>>>
>>>> fsync, for example, may affect more than one fd--all data in the drive
>>>> write cache will be flushed. I don't see how this is so different. I
>>>> think a sysfs file is awkward because it requires an application to
>>>> chase down the correct file in the sysfs hierarchy. If the application
>>>> already has an open fd or a mapping, it should be able to operate on
>>>> that.
>>>
>>> I'm teetering, but still leaning towards sysfs. The use case that
>>> needs this is device-dax because we otherwise silently do this behind
>>> the application's back on filesystem-dax for fsync / msync.
>>
>> We may yet get file system support for flush from userspace (NOVA, for
>> example). So I don't think we should restrict ourselves to only
>> thinking about the device dax use case.
>>
>>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>>> that the device-dax instance is fronting persistent memory. There's
>>> nothing persistent memory specific about device-dax except that today
>>> only the nvdimm sub-system knows how to create them, but there's
>>> nothing that prevents other memory regions from being mapped this way.
>>
>> You're concerned that applications operating on device dax instances
>> that are not backed by pmem will try to issue a deep flush? Why would
>> they do that, and why can't you just return failure from the ioctl?
>>
>>> So I'd rather this persistent memory specific mechanism stay with the
>>> persistent memory specific portion of the interface rather than plumb
>>> persistent memory details out through the generic device-dax interface
>>> since we have no other intercept point like we do in the
>>> filesystem-dax case to hide this flush.
>>
>> Look at the block layer. You can issue an ioctl on a block device, and
>> if the generic block layer can handle it, it does. If not, it gets
>> passed down to lower layers until either it gets handled, or it bubbles
>> back up because nobody knew what to do with it. I think you can do the
>> same thing here, and that solves your layering violation.
>
> So this is where I started. I was going to follow the block layer.
> Except recently the block layer has been leaning away from ioctls and
> implementing support for syscalls directly. The same approach for
> device-dax fallocate() support got NAKd, so I opted for sysfs out of
> the gate.
>
> However, since there really is no analog for "deep flush" in the
> syscall namespace lets (*gag*) implement an ioctl for this.

So I think about it for 2 seconds and now I'm veering back to sysfs.
We don't want applications calling this ioctl on filesystem-dax fds. I
simply can't bring myself to do the work to pick a unique ioctl number
that is known to be unique for the full filsystem and block-layer
paths when we can put this interface right where it belongs in nvdimm
specific sysfs.