Re: [PATCH v4 2/2] mm/gup/writeback: add callbacks for inaccessible pages
From: Claudio Imbrenda
Date: Tue Apr 14 2020 - 12:04:30 EST
On Mon, 13 Apr 2020 13:22:24 -0700
Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> > On s390x the function is not supposed to fail, so it is ok to use a
> > WARN_ON on failure. If we ever need some more finegrained handling
> > we can tackle this when we know the details.
>
> Could you explain a bit why the function can't fail?
the concept of "making accessible" is only to make sure that accessing
the page will not trigger faults or I/O or DMA errors. in general it
does not mean freely accessing the content of the page in cleartext.
on s390x, protected guest pages can be shared. the guest has to
actively share its pages, and in that case those pages are both part of
the protected VM and freely accessible by the host.
pages that are not shared cannot be accessed by the host.
in our case "making the page accessible" means:
- if the page was shared, make sure it stays shared
- if the page was not shared, first encrypt it and then make it
accessible to the host (both operations performed securely and
atomically by the hardware)
then the page can be swapped out, or used for direct I/O (obviously if
you do I/O on a page that was not shared, you cannot expect good
things to happen, since you basically corrupt the memory of the guest).
on s390x performing I/O directly on protected pages results in (in
practice) unrecoverable I/O errors, so we want to avoid it at all costs.
accessing protected pages from the CPU triggers an exception that can
be handled (and we do handle it, in fact)
now imagine a buggy or malicious qemu process crashing the whole machine
just because it did I/O to/from a protected page. we clearly don't want
that.
> If the guest has secret data in the page, then it *can* and does fail.
no, that's the whole point of this mechanism. in fact, most of the
guest pages will be "secret data", only the few pages used for guest I/O
bounce buffers will be shared with the host
> It won't fail, though, if the host and guest agree on whether the page
> is protected.
>
> Right?
>
> > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> > unlock_page_memcg(page);
> > + access_ret = arch_make_page_accessible(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot
> > be made
> > + * accessible, it is too late to recover here.
> > + */
> > + VM_BUG_ON_PAGE(access_ret != 0, page);
> > +
> > return ret;
> >
> > }
>
> This seems like a really odd place to do this. Writeback is specific
> to block I/O. I would have thought there were other kinds of devices
> that matter, not just block devices.
well, yes and no. for writeback (block I/O and swap) this is the right
place. at this point we know that the page is present and nobody else
has started doing I/O yet, and I/O will happen soon-ish. so we make the
page accessible. there is no turning back here, unlike pinning. we
are not allowed to fail, we can't
regarding the other kinds of devices: yes, they will use pinning, which
is covered by the rest of the patch. the semantics of get page and pin
page (if the documentation has not changed meanwhile) is that the
traditional get_page is used for when the page is needed but not its
content, and pin_page is used when the content of the page is accessed.
since the only issue here is accessing the content of the page, we
don't need to make it accessible for get_page, but only for pin_page.
get_page and pin_page are allowed to fail, so in this case we return an
error code, so other architectures can potentially abort the pinning if
needed. on s390x we will never fail, for the same reasons written
above.
> Also, this patch seems odd that it only does the
> arch_make_page_accessible() half. Where's the other half where the
> page is made inaccessible?
that is very arch-specific. for s390x, you can look at this patch and
the ones immediately before/after: 214d9bbcd3a67230b932f6ce
> I assume it's OK to "leak" things like this, it's just not clear to me
> _why_ it's OK.
nothing is being leaked :)
I hope I clarified a little how this works on s390x :)
feel free to poke me again if some things are still unclear
best regards,
Claudio Imbrenda