Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer

From: Boaz Harrosh
Date: Wed Mar 18 2015 - 10:39:09 EST

On 03/18/2015 03:06 PM, Matthew Wilcox wrote:
> On Wed, Mar 18, 2015 at 12:47:21PM +0200, Boaz Harrosh wrote:
>> God! Look at this endless list of files and it is only the very beginning.
>> It does not even work and touches only 10% of what will need to be touched
>> for this to work, and very very marginally at that. There will always be
>> "another subsystem" that will not work. For example NUMA how will you do
>> NUMA aware pmem? and this is just a simple example. (I'm saying NUMA
>> because our tests show a huge drop in performance if you do not do
>> NUMA aware allocation)
> You're very entertaining, but please, tone down your emails and stick
> to facts. The BIOS presents the persistent memory as one table entry
> per NUMA node, so you get one block device per NUMA node. There's no
> mixing of memory from different NUMA nodes within a single filesystem,
> unless you have a filesystem that uses multiple block devices.

Not current BIOS, if we have them contiguous then they are presented as
one range. (DDR3 BIOS). But I agree it is a bug and in our configuration
we separate them to different pmem devices.

Yes I meant a "filesystem that uses multiple block devices"

>> I'm not the one afraid of hard work, if it was for a good cause, but for what?
>> really for what? The block layer, and RDMA, and networking, and spline, and what
>> ever the heck any one wants to imagine to do with pmem, already works perfectly
>> stable. right now!
> The overhead. Allocating a struct page for every 4k page in a 400GB DIMM
> (the current capacity available from one NV-DIMM vendor) occupies 6.4GB.
> That's an unacceptable amount of overhead.

So lets fix the stacks to work nice with 2M pages. That said we can
allocate the struct page also from pmem if we need to. The fact remains
that we need state down the different stacks and this is the current
design over all.

I hate it that you introduce a double design a pfn-or-page and the
combinations of them. It is ugliness to much for my guts. I would
like a unified design. that runs all over the stack. Already we have
too much duplication to my taste, and would love to see more
unification and not more splitting.

But the most important for me is do we have to sacrifice the short
term to the long term. Such a massive change as you are proposing
it will take years. for a theoretical 400GB DIMM. What about the
4G DIMM now in peoples hands, need they wait?
(Though I still do not agree with your design)

I love the SPARSE model of the "section" and the page being it's
own identity relative to virtual & PFN of the section. We could
think of a much smaller page-struct that only takes a ref-count
and flags and have bigger page type for regular use, separate the
low common part of the page, lay down clear rules about its use,
and an high part that's per user. But let us think of a unified
design through out. (most members of page are accessed through
wrappers it would be relatively easy to split)

And let us not sacrifice the now for the "far tomorrow", we should
be able to do this incrementally, wasting more space now and saving

[We can even invent a sizeless page you know how we encode
the section ID directly into the 64 bit address of the page,
So we can have a flag at the section that says this is a
zero-size page section and the needed info is stored at
the section object. But I still think you will need state
per page and that we do need a minimal size.

[BTW: The only 400GB DIMM I know of is a real flash, and not directly
mapped to CPU, OK maybe read only, but the erase/write makes it
logical-to-physical managed and not directly accessed

And a personal note. I mean only to entertain. If any one feels
I "toned-up", please forgive me. I meant no such thing. As a rule
if I come across strong then please just laugh and don't take me
seriously. I only mean scientific soundness.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at