Re: [PATCH v4 2/2] mm/gup/writeback: add callbacks for inaccessible pages
From: Peter Zijlstra
Date: Wed Apr 15 2020 - 18:18:47 EST
On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote:
> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> > + /*
> > + * We need to make the page accessible if and only if we are going
> > + * to access its content (the FOLL_PIN case). Please see
> > + * Documentation/core-api/pin_user_pages.rst for details.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + page = ERR_PTR(ret);
> > + goto out;
> > + }
> > + }
>
> Thanks, Claudio, for a really thorough refresher on this in private mail.
>
> But, I think this mechanism probably hooks into the wrong place. I
> don't doubt that it *functions* on s390, but I think these calls are
> misplaced. I think the end result is that no other architecture will
> have a chance to use the same hooks. They're far too s390-specific even
> for a concept that's not limited to s390.
>
> get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this
> page's contents". The kmap() family is really what we use for that.
> kmap()s are often *preceded* by get_user_pages(), which is probably why
> this works for you, though.
>
> Yes, the docs do say that FOLL_PIN is for accessing the pages. But,
> there's a crucial thing that it leaves out: *WHO* will be accessing the
> pages. For Direct IO, for instance, the CPU isn't touching the page at
> all. It's always a device. Also, crucially, the page contents are
> *not* accessible from the CPU's perspective after a gup. They're not
> accessible until a kmap(). They're also not even accessible for
> *devices* after a gup. There's a _separate_ mapping process that's
> requires to make them accessible to the CPU.
I think the crucial detail is that we can fail gup(), while we cannot
ever fail kmap() or whatever else a device needs to do.
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
> > int __test_set_page_writeback(struct page *page, bool keep_write)
> > {
> > struct address_space *mapping = page_mapping(page);
> > - int ret;
> > + int ret, access_ret;
> >
> > lock_page_memcg(page);
> > if (mapping && mapping_use_writeback_tags(mapping)) {
> > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> > }
> > unlock_page_memcg(page);
> > + access_ret = arch_make_page_accessible(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot be made
> > + * accessible, it is too late to recover here.
> > + */
> > + VM_BUG_ON_PAGE(access_ret != 0, page);
> > +
> > return ret;
> >
> > }
>
> I think this one really shows the cracks in the approach. Pages being
> swapped *don't* have get_user_pages() done on them since we've already
> got the physical page at the time writeback and aren't looking at PTEs.
I suspect this happens because FOLL_TOUCH or something later does
set_page_dirty() on the page, which then eventually gets it in
writeback.
Failing gup() ealier, should ensure the above VM_BUG never happens,
unless someone is doing dodgy things.
> Why do I care?
>
> I was looking at AMD's SEV (Secure Encrypted Virtualization) code which
> is in the kernel which shares some implementation details with the
> not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages
> being encrypted and being gibberish to the host, plus a handshake to
> share guest-selected pages. Some of the side-effects of exposing the
> gibberish to the host aren't great (I think it can break cache coherency
> if a stray write occurs) and it would be nice to get better behavior.
>
> But, to get better behavior, the host kernel might need to remove pages
> from its direct map, making them inaccessible.
But for SEV we would actually need to fail this
arch_make_page_acesssible() thing, right? The encrypted guest pages
cannot be sanely accessed by the host IIRC, ever. Isn't their encryption
key linked to the phys addr of the page?
> I was hoping to reuse
> arch_make_page_accessible() for obvious reasons. But, get_user_pages()
> is not the right spot to map pages because they might not *ever* be
> accessed by the CPU, only devices.
I'm confused, why does it matter who accesses it? The point is that they
want to access it through this vaddr/mapping.