Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
From: Edgecombe, Rick P
Date: Thu Jan 26 2023 - 15:16:55 EST
On Thu, 2023-01-26 at 09:57 +0100, David Hildenbrand wrote:
> On 25.01.23 19:43, Edgecombe, Rick P wrote:
> > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > > > Roughly speaking: if we abstract it that way and get all of
> > > > > the
> > > > > "how
> > > > > to
> > > > > set it writable now?" out of core-MM, it not only is cleaner
> > > > > and
> > > > > less
> > > > > error prone, it might even allow other architectures that
> > > > > implement
> > > > > something comparable (e.g., using a dedicated HW bit) to
> > > > > actually
> > > > > reuse
> > > > > some of that work. Otherwise most of that "shstk" is really
> > > > > just
> > > > > x86
> > > > > specific ...
> > > > >
> > > > > I guess the only cases we have to special case would be page
> > > > > pinning
> > > > > code where pte_write() would indicate that the PTE is
> > > > > writable
> > > > > (well,
> > > > > it
> > > > > is, just not by "ordinary CPU instruction" context directly):
> > > > > but
> > > > > you
> > > > > do
> > > > > that already, so ... :)
> > > > >
> > > > > Sorry for stumbling over that this late, I only started
> > > > > looking
> > > > > into
> > > > > this when you CCed me on that one patch.
> > > >
> > > > Sorry for not calling more attention to it earlier. Appreciate
> > > > your
> > > > comments.
> > > >
> > > > Previously versions of this series had changed some of these
> > > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > > > vma.
> > > > This way an x86 implementation could use the VM_SHADOW_STACK
> > > > vma
> > > > flag
> > > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > > > feedback
> > > > was that in some of these code paths "maybe" isn't really an
> > > > option, it
> > > > *needs* to make it writable. Even though the logic was the
> > > > same,
> > > > the
> > > > name of the function made it look wrong.
> > > >
> > > > But another option could be to change pte_mkwrite() to take a
> > > > vma.
> > > > This
> > > > would save using another software bit on x86, but instead
> > > > requires
> > > > a
> > > > small change to each arch's pte_mkwrite().
> > >
> > > I played with that idea shortly as well, but discarded it. I was
> > > not
> > > able to convince myself that it wouldn't be required to pass in
> > > the
> > > VMA
> > > as well for things like pte_dirty(), pte_mkdirty(), pte_write(),
> > > ...
> > > which would end up fairly ugly (or even impossible in thing slike
> > > GUP-fast).
> > >
> > > For example, I wonder how we'd be handling stuff like
> > > do_numa_page()
> > > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> > > either
> > > call might set the PTE writable and maintain dirty bit ...
> >
> > pte_modify() is handled like this currently:
> >
> >
https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@xxxxxxxxx/
> >
> > There has been a couple iterations on that. The current solution is
> > to
> > do the Dirty->SavedDirty fixup if needed after the new prots are
> > added.
> >
> > Of course pte_modify() can't know whether you are are attempting to
> > create a shadow stack PTE with the prot you are passing in. But the
> > callers today explicitly call pte_mkwrite() after filling in the
> > other
> > bits with pte_modify().
>
> See below on my MAP_PRIVATE vs. MAP_SHARED comment.
Yep, MAP_SHARED support was dropped with the reboot of the series. It
did have some problems IIRC.
Now shadow stack memory creation is tightly controlled. Either created
via special syscall or automatically with a new thread.
>
> > Today this patch causes the pte_mkwrite() to be
> > skipped and another fault may be required in the mprotect() and
> > numa
> > cases, but if we change pte_mkwrite() to take a VMA we can just
> > make it
> > shadow stack to start.
> >
> > It might be worth mentioning, there was a suggestion in the past to
> > try
> > to have the shadow stack bits come out of vm_get_page_prot(), but
> > MM
> > code would then try to map the zero page as (shadow stack) writable
> > when there was a normal (non-shadow stack) read access. So I had to
> > abandon that approach and rely on explicit calls to
> > pte_mkwrite/shstk()
> > to make it shadow stack.
>
> Thanks, do you have a pointer?
I never posted it because it didn't work out. This was the comment that
prompted the exploration in that direction:
https://lore.kernel.org/lkml/8065c333-0911-04a2-f91e-7c2e0cc7ec51@xxxxxxxxx/
Shadow stack memory also used to not be VM_WRITE (VM_SHADOW_STACK
only), but this was changed for other reasons. In v2 there were some
updates to how shadow stack memory was handled, and the cover letter
had a writeup of the reasons and general design:
https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@xxxxxxxxx/
>
> >
> > >
> > > Having that said, maybe it could work with only a single saved-
> > > dirty
> > > bit
> > > and passing in the VMA for pte_mkwrite() only.
> > >
> > > pte_wrprotect() would detect "writable=0,dirty=1" and move the
> > > dirty
> > > bit
> > > to the soft-dirty bit instead, resulting in
> > > "writable=0,dirty=0,saved-dirty=1",
> > >
> > > pte_dirty() would return dirty==1||saved-dirty==1.
> > >
> > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> > > depending
> > > on the writable bit.
> > >
> > > pte_mkclean() would clean both bits.
> > >
> > > pte_write() would detect "writable == 1 || (writable==0 &&
> > > dirty==1)"
> > >
> > > pte_mkwrite() would act according to the VMA, and in addition,
> > > merge
> > > the
> > > saved-dirty bit into the dirty bit.
> > >
> > > pte_modify() and mk_pte() .... would require more thought ...
> >
> > Not sure I'm following what the mk_pte() problem would be. You mean
> > if
> > Write=0,Dirty=1 is manually added to the prot?
> >
> > Shouldn't people generally use the pte_mkwrite() helpers unless
> > they
> > are drawing from a prot that was already created with the helpers
> > or
> > vm_get_page_prot()?
>
> pte_mkwrite() is mostly only used (except for writenotify ...) for
> MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory,
> vma->vm_page_prot in a VM_WRITE mapping already contains the write
> permissions. pte_mkwrite() is not necessary (again, unless
> writenotify
> is active).
Oh, interesting.
>
> I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you
> didn't stumble over that issue yet? Because I don't see how it could
> work with MAP_SHARED VMAs.
Yep, it doesn't support MAP_SHARED.
>
>
> The other thing I had in mind was that we have to make sure that
> we're
> not accidentally setting "Write=0,Dirty=1" in mk_pte() /
> pte_modify().
>
> Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect
> using pte_modify(), we have to make sure to move the dirty bit to
> the
> saved_dirty bit.
For the mk_pte() case, I don't think a Write=0,Dirty=1 prot could come
from anywhere. I guess the MAP_SHARED case is a little less bounded. We
could maybe add a warning for this case.
For the pte_modify() case, this does happen. There are two scenarios
considered:
1. A Write=0,Dirty=0 PTE is made dirty. This can't happen today as
Dirty is filtered via _PAGE_CHG_MASK. Basically pte_modify() doesn't
support it.
2. A Write=1,Dirty=1 PTE gets write protected. This does happen because
the Write=0 prot comes from protection_map, and pte_modify() would
leave the Dirty=1 bit alone. The main case I know of is mprotect(). It
is handled by changes to pte_modify() by doing the Dirty->SoftDirty
fixup if needed.
So pte_modify()s job should not be too tricky. What you can't do with
it though, is create shadow stack PTEs. But it is ok for our uses
because of the explicit mkwrite().
>
> > I think they can't manually create prot's from bits
> > in core mm code, right? And x86 arch code already has to be aware
> > of
> > shadow stack. It's a bit of an assumption I guess, but I think
> > maybe
> > not too crazy of one?
>
> I think that's true. Arch code is supposed to deal with that IIRC.
>
> >
> > >
> > >
> > > Further, ptep_modify_prot_commit() might have to be adjusted to
> > > properly
> > > flush in all relevant cases IIRC.
> >
> > Sorry, I'm not following. Can you elaborate? There is an adjustment
> > made in pte_flags_need_flush().
>
> Note that I did not fully review all bits of this patch set, just
> throwing out what was on my mind. If already handled, great.
>
> >
> > >
> > > >
> > > > x86's pte_mkwrite() would then be pretty close to
> > > > maybe_mkwrite(),
> > > > but
> > > > maybe it could additionally warn if the vma is not writable. It
> > > > also
> > > > seems more aligned with your changes to stop taking hints from
> > > > PTE
> > > > bits
> > > > and just look at the VMA? (I'm thinking about the dropping of
> > > > the
> > > > dirty
> > > > check in GUP and dropping pte_saved_write())
> > >
> > > The soft-shstk bit wouldn't be a hint, it would be logically
> > > changing
> > > the "type" of the PTE such that any other PTE functions can do
> > > the
> > > right
> > > thing without having to consume the VMA.
> >
> > Yea, true.
> >
> > Thanks for your comments and ideas here, I'll give the:
> > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> > ...solution a try.
>
> Good!
>