Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

From: Linus Torvalds
Date: Thu May 07 2015 - 11:00:14 EST


On Wed, May 6, 2015 at 7:36 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> My pet concrete example is covered by __pfn_t. Referencing persistent
> memory in an md/dm hierarchical storage configuration. Setting aside
> the thrash to get existing block users to do "bvec_set_page(page)"
> instead of "bvec->page = page" the onus is on that md/dm
> implementation and backing storage device driver to operate on
> __pfn_t. That use case is simple because there is no use of page
> locking or refcounting in that path, just dma_map_page() and
> kmap_atomic().

So clarify for me: are you trying to make the IO stack in general be
able to use the persistent memory as a source (or destination) for IO
to _other_ devices, or are you talking about just internally shuffling
things around for something like RAID on top of persistent memory?

Because I think those are two very different things.

For example, one of the things I worry about is for people doing IO
from persistent memory directly to some "slow stable storage" (aka
disk). That was what I thought you were aiming for: infrastructure so
that you can make a bio for a *disk* device contain a page list that
is the persistent memory.

And I think that is a very dangerous operation to do, because the
persistent memory itself is going to have some filesystem on it, so
anything that looks up the persistent memory pages is *not* going to
have a stable pfn: the pfn will point to a fixed part of the
persistent memory, but the file that was there may be deleted and the
memory reassigned to something else.

That's the kind of thing that "struct page" helps with for normal IO
devices. It's both a source of serialization and indirection, so that
when somebody does a "truncate()" on a file, we don't end up doing IO
to random stale locations on the disk that got reassigned to another
file.

So "struct page" is very fundamental. It's *not* just a "this is the
physical source/drain of the data you are doing IO on".

So if you are looking at some kind of "zero-copy IO", where you can do
IO from a filesystem on persistent storage to *another* filesystem on
(say, a big rotational disk used for long-term storage) by just doing
a bo that targets the disk, but has the persistent memory as the
source memory, I really want to understand how you are going to
serialize this.

So *that* is what I meant by "What is the primary thing that is
driving this need? Do we have a very concrete example?"

I abvsolutely do *not* want to teach the bio subsystem to just
randomly be able to take the source/destination of the IO as being
some random pfn without knowing what the actual uses are and how these
IO's are generated in the first place.

I was assuming that you wanted to do something where you mmap() the
persistent memory, and then write it out to another device (possibly
using aio_write()). But that really does require some kind of
serialization at a higher level, because you can't just look up the
pfn's in the page table and assume they are stable: they are *not*
stable.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/