Re: [GIT PULL] PMEM driver for v4.1

From: Ingo Molnar
Date: Wed Apr 15 2015 - 04:45:15 EST

* Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> > None of this gives me warm fuzzy feelings...
> >
> > ... has anyone explored the possibility of putting 'struct page'
> > into the pmem device itself, essentially using it as metadata?
> Yes, the impetus for proposing the pfn conversion of the block layer
> was the consideration that persistent memory may have less write
> endurance than DRAM. The kernel preserving write endurance
> exclusively for user data and the elimination of struct page
> overhead motivated the patchset [1].
> [1]:

(Is there a Git URL where I could take a look at these patches?)

But, I think the usage of pfn's in the block layer is relatively
independent of the question whether a pmem region should be
permanently struct page backed or not.

I think the main confusion comes from the fact that 'pfn' can have two
roles with sufficiently advanced MMIO interfaces: describing main RAM
page (struct page), but also describing essentially sectors on a
large, MMIO-accessible storage device, directly visible to the CPU but
otherwise not RAM.

So for that reason I think pmem devices should be both struct page
backed and not struct page backed, depending on their physical



If a pmem device is in any way expected to be write-unreliable (i.e.
it's not DRAM but flash) then it's going to be potentially large and
we simply cannot use struct page backing for it, full stop.

Users very likely want a filesystem on it, with double buffering that
both reduces wear and makes better use of main RAM and CPU caches.

In this case the pmem device is a simple storage device that has a
refreshlingly clean hardware ABI that exposes all of its contents in a
large, directly mapped MMIO region in essence.

We don't back mass storage with struct page, we never did with any of
the other storage devices either.

I'd expect this to be the 90% dominant 'pmem usecase' in the future.

In this case any 'direct mapping' system calls, DIO or
non-double-buffering mmaps() and DAX on the other hand will stay a
'weird' secondary usecases for user-space operating systems like
databases that want to take caching out of the hands of the kernel.

The majority of users will use it as storage, with a filesystem on it
and regular RAM caching it for everyone's gain. All the struct page
based APIs and system calls will work just fine, and the rare usecases
will be served by DAX.



But if a pmem device is RAM, with no write unreliability, then we
obviously want it to have struct page backing, and we probably want to
think about it more in terms of hot-pluggable memory, than a storage

This scenario will be less common than the mass-storage scenario.

Note that this is similar to how GPU memory is categorized: it's
essentially RAM-alike, which naturally results in struct page backing.


Note that scenarios 1) and 2) are not under our control, they are
essentially a physical property, with some user policy influencing it
as well. So we have to support both and we have no 'opinion' about
which one is right, as it's simply physical reality as-is.

In that sense I think this driver does the right thing as a first
step: it exposes pmem regions in the more conservative fashion, as a
block storage device, assuming write unreliability.

Patches that would turn the pmem driver into unconditionally struct
page backed would be misguided for this usecase. Allocating and
freeing struct page arrays on the fly would be similarly misguided.

But patches that allow pmem regions that declare themselves true RAM
to be inserted as hotplug memory would be the right approach IMHO -
while still preserving the pmem block device and the non-struct-page
backed approach for other pmem devices.

Note how in this picture the question of how IO scatter-gather lists
are constructed is an implementational detail that does not impact the
main design: they are essentially DMA abstractions for storage
devices, implemented efficiently via memcpy() in the pmem case, and
both pfn lists and struct page lists are pretty equivalent approaches
for most usages.

The only exception are the 'weird' usecases like DAX, DIO and RDMA:
these have to be pfn driven, due to the lack of struct page
descriptors for storage devices in general. In that case the 'pfn'
isn't really memory, but a sector_t equivalent, for this new type of
storage DMA that is implemented via a memcpy().

In that sense the special DAX page fault handler looks like a natural
approach as well: the pfn's in the page table aren't really describing
memory pages, but 'sectors' on an IO device - with special rules,
limited APIs and ongoing complications to be expected.

At least that's how I see it.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at