Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer

From: Boaz Harrosh
Date: Thu Mar 19 2015 - 11:54:29 EST

On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
> Dan missed "Support O_DIRECT to a mapped DAX file". More generally, if we
> want to be able to do any kind of I/O directly to persistent memory,
> and I think we do, we need to do one of:
> 1. Construct struct pages for persistent memory
> 1a. Permanently
> 1b. While the pages are under I/O
> 2. Teach the I/O layers to deal in PFNs instead of struct pages
> 3. Replace struct page with some other structure that can represent both
> I'm personally a fan of #3, and I was looking at the scatterlist as
> my preferred data structure. I now believe the scatterlist as it is
> currently defined isn't sufficient, so we probably end up needing a new
> data structure. I think Dan's preferred method of replacing struct
> pages with PFNs is actually less instrusive, but doesn't give us as
> much advantage (an entirely new data structure would let us move to an
> extent based system at the same time, instead of sticking with an array
> of pages). Clearly Boaz prefers 1a, which works well enough for the
> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
> What's your preference? I guess option 0 is "force all I/O to go
> through the page cache and then get copied", but that feels like a nasty
> performance hit.

Thanks Matthew, you have summarized it perfectly.

I think #1b might have merit, as well. I have a very surgical small
"hack" that we can do with allocating on demand pages before IO.
It involves adding a new MEMORY_MODEL policy that is derived from
SPARSEMEM but lets you allocate individual pages on demand. And a new
type of page say call it GP_emulated_page.
(Tell me if you find this interesting. It is 1/117 in size of both
#2 or #3)

In anyway please reconsider a configurable #1a for people that do
not mind sacrificing 1.2% of their pmem for real pages.

Even at 6G page-structs with 400G pmem, people would love some of the stuff
this gives them today. just few examples: direct_access from within a VM to
an host defined pmem, is trivial with no extra code with my two simple #1a
patches. RDMA memory brick targets, network shared memory FS and so on, the
list will always be bigger then any of #1b #2 or #3. Yes for people that
want to sacrifice the extra cost.

In the Kernel it was always about choice and diversity. And what does it
costs us. Nothing. Two small simple patches and a Kconfig option.
Note that I made it in such a way that if pmem is configured without
use of pages, then the mm code is *not* configured-in automatically.
We can even add a runtime option that even if #1a is enabled, for certain
pmem device may not want pages allocated. And so choose at runtime rather
than compile time.

I think this will only farther our cause and let people advance with
their research and development with great new ideas about use of pmem.
Then once there is a great demand for #1a and those large 512G devices
come out, we can go the #1b or #3 route and save them the extra 1.2%
memory, but once they have the appetite for it. (And Andrews question
becomes clear)

Our two ways need not be "either-or". They can be "have both". I think
choice is a good thing for us here. Even with #3 available #1a still has
merit in some configurations and they can co exist perfectly.

Please think about it?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at