Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

From: Xiao Guangrong
Date: Fri Sep 09 2016 - 05:00:59 EST




On 09/09/2016 07:04 AM, Dan Williams wrote:
On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler
<ross.zwisler@xxxxxxxxxxxxxxx> wrote:
On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
[ adding linux-fsdevel and linux-nvdimm ]

On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
<guangrong.xiao@xxxxxxxxxxxxxxx> wrote:
[..]
However, it is not easy to handle the case that the new VMA overlays with
the old VMA
already got by userspace. I think we have some choices:
1: One way is completely skipping the new VMA region as current kernel code
does but i
do not think this is good as the later VMAs will be dropped.

2: show the un-overlayed portion of new VMA. In your case, we just show the
region
(0x2000 -> 0x3000), however, it can not work well if the VMA is a new
created
region with different attributions.

3: completely show the new VMA as this patch does.

Which one do you prefer?


I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?

My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed. XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag. We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable. I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping. Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...

I think this goes back to our previous discussion about support for the PMEM
programming model. Really I think what NVML needs isn't a way to tell if it
is getting a DAX mapping, but whether it is getting a DAX mapping on a
filesystem that fully supports the PMEM programming model. This of course is
defined to be a filesystem where it can do all of its flushes from userspace
safely and never call fsync/msync, and that allocations that happen in page
faults will be synchronized to media before the page fault completes.

IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
everything or can I rely fully on flushes from userspace?"

For all existing implementations, I think the answer is "you need to use
fsync/msync" because we don't yet have proper support for the PMEM programming
model.

My best idea of how to support this was a per-inode flag similar to the one
supported by XFS that says "you have a PMEM capable DAX mapping", which NVML
would then interpret to mean "you can do flushes from userspace and be fully
safe". I think we really want this interface to be common over XFS and ext4.

If we can figure out a better way of doing this interface, say via mincore,
that's fine, but I don't think we can detangle this from the PMEM API
discussion.

Whether a persistent memory mapping requires an msync/fsync is a
filesystem specific question. This mincore proposal is separate from
that. Consider device-DAX for volatile memory or mincore() called on
an anonymous memory range. In those cases persistence and filesystem
metadata are not in the picture, but it would still be useful for
userspace to know "is there page cache backing this mapping?" or "what
is the TLB geometry of this mapping?".

I got a question about msync/fsync which is beyond the topic of this thread :)

Whether msync/fsync can make data persistent depends on ADR feature on memory
controller, if it exists everything works well, otherwise, we need to have another
interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is
particularly useful for nvdimm virtualization if we use normal memory to emulate
nvdimm with data persistent characteristic (the data will be flushed to a
persistent storage, e.g, disk).

Does current PMEM programming model fully supports 'Flush hint table'? Is
userspace allowed to use these addresses?

Thanks!