Re: [PATCH 6/9] x86/mm: Introduce ptep_set_wrprotect_flush and related functions

From: Andy Lutomirski
Date: Thu Jun 07 2018 - 21:00:03 EST


On Thu, Jun 7, 2018 at 1:30 PM Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> wrote:
>
> On 06/07/2018 09:24 AM, Andy Lutomirski wrote:
>
> >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma,
> >> + unsigned long addr, pte_t *ptep)
> >> +{
> >> + bool rw;
> >> +
> >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) {
> >> + struct mm_struct *mm = vma->vm_mm;
> >> + pte_t pte;
> >> +
> >> + if (rw && (atomic_read(&mm->mm_users) > 1))
> >> + pte = ptep_clear_flush(vma, addr, ptep);
> > Why are you clearing the pte?
>
> I found my notes on the subject. :)
>
> Here's the sequence that causes the problem. This could happen any time
> we try to take a PTE from read-write to read-only. P==Present, W=Write,
> D=Dirty:
>
> CPU0 does a write, sees PTE with P=1,W=1,D=0
> CPU0 decides to set D=1
> CPU1 comes in and sets W=0
> CPU0 does locked operation to set D=1
> CPU0 sees P=1,W=0,D=0
> CPU0 sets back P=1,W=0,D=1
> CPU0 loads P=1,W=0,D=1 into the TLB
> CPU0 attempts to continue the write, but sees W=0 in the TLB and a #PF
> is generated because of the write fault.
>
> The problem with this is that we end up with a shadowstack-PTE
> (Write=0,Dirty=1) where we didn't want one. This, unfortunately,
> imposes extra TLB flushing overhead on the R/W->R/O transitions that
> does not exist before shadowstack enabling.
>

So what exactly do the architects want the OS to do? AFAICS the only
valid ways to clear the dirty bit are:

--- Choice 1 ---
a) Set P=0.
b) Flush using an IPI
c) Read D (so we know if the page was actually dirty)
d) Set P=1,W=0,D=0

and we need to handle spurious faults that happen between steps (a)
and (c). This isn't so easy because the straightforward "is the fault
spurious" check is going to think it's *not* spurious.

--- Choice 2 ---
a) Set W=0
b) flush
c) Test and clear D

and we need to handle the spurious fault between b and c. At least
this particular spurious fault is easier to handle since we can check
the error code.

But surely the right solution is to get the architecture team to see
if they can fix the dirty-bit-setting algorithm or, even better, to
look and see if the dirty-bit-setting algorithm is *already* better
and just document it. If the cpu does a locked set-bit on D in your
example, the CPU is just being silly. The CPU should make the whole
operation fully atomic: when trying to write to a page that's D=0 in
the TLB, it should re-walk the page tables and, atomically, load the
PTE and, if it's W=1,D=0, set D=1. I'd honestly be a bit surprised if
modern CPUs don't already do this.

(Hmm. If the CPUs were that smart, then we wouldn't need a flush at
all in some cases. If we lock cmpxchg to change W=1,D=0 to W=0,D=0,
then we know that no other CPU can subsequently write the page without
re-walking, and we don't need to flush.)

Can you ask the architecture folks to clarify the situation? And, if
your notes are indeed correct, don't we need code to handle spurious
faults?

--Andy