Re: [PATCH v3] libnvdimm, region: sysfs trigger for nvdimm_flush()

From: Dan Williams
Date: Thu Apr 27 2017 - 16:02:52 EST


On Thu, Apr 27, 2017 at 12:40 PM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>
>> On Thu, Apr 27, 2017 at 11:41 AM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
>>> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>>>
>>>>> The sentiment is that programs shouldn't have to grovel around in sysfs
>>>>> to do stuff related to an open file descriptor or mapping. I don't take
>>>>> issue with the name. I do worry that something like 'wpq_drain' may be
>>>>> too platform specific, though. The NVM Programming Model specification
>>>>> is going to call this "deep flush", so maybe that will give you
>>>>> some inspiration if you do want to change the name.
>>>>
>>>> I'll change to "deep_flush", and I quibble that this is related to a
>>>> single open file descriptor or mapping. It really is a "region flush"
>>>> for giving extra protection for global metadata, but the persistence
>>>> of individual fds or mappings is handled by ADR. I think an ioctl
>>>> might give the false impression that every time you flush a cacheline
>>>> to persistence you need to call the ioctl.
>>>
>>> fsync, for example, may affect more than one fd--all data in the drive
>>> write cache will be flushed. I don't see how this is so different. I
>>> think a sysfs file is awkward because it requires an application to
>>> chase down the correct file in the sysfs hierarchy. If the application
>>> already has an open fd or a mapping, it should be able to operate on
>>> that.
>>
>> I'm teetering, but still leaning towards sysfs. The use case that
>> needs this is device-dax because we otherwise silently do this behind
>> the application's back on filesystem-dax for fsync / msync.
>
> We may yet get file system support for flush from userspace (NOVA, for
> example). So I don't think we should restrict ourselves to only
> thinking about the device dax use case.
>
>> A device-dax ioctl would be straightforward, but 'deep flush' assumes
>> that the device-dax instance is fronting persistent memory. There's
>> nothing persistent memory specific about device-dax except that today
>> only the nvdimm sub-system knows how to create them, but there's
>> nothing that prevents other memory regions from being mapped this way.
>
> You're concerned that applications operating on device dax instances
> that are not backed by pmem will try to issue a deep flush? Why would
> they do that, and why can't you just return failure from the ioctl?
>
>> So I'd rather this persistent memory specific mechanism stay with the
>> persistent memory specific portion of the interface rather than plumb
>> persistent memory details out through the generic device-dax interface
>> since we have no other intercept point like we do in the
>> filesystem-dax case to hide this flush.
>
> Look at the block layer. You can issue an ioctl on a block device, and
> if the generic block layer can handle it, it does. If not, it gets
> passed down to lower layers until either it gets handled, or it bubbles
> back up because nobody knew what to do with it. I think you can do the
> same thing here, and that solves your layering violation.

So this is where I started. I was going to follow the block layer.
Except recently the block layer has been leaning away from ioctls and
implementing support for syscalls directly. The same approach for
device-dax fallocate() support got NAKd, so I opted for sysfs out of
the gate.

However, since there really is no analog for "deep flush" in the
syscall namespace lets (*gag*) implement an ioctl for this.