Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

From: David Hildenbrand
Date: Mon Apr 04 2022 - 09:04:52 EST


On 01.04.22 21:15, Nadav Amit wrote:
> [ +Rick ]
>
>> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>>
>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>> can try mapping anonymous pages writable if they are exclusive,
>> the PTE is already dirty, and no special handling applies. Mapping the
>> PTE writable is essentially the same thing the write fault handler would do
>> in this case.
>
> In general I am all supportive for such a change.
>
> I do have some mostly-minor concerns.

Hi Nadav,

thanks a lot for your review!

>
>>
>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> + unsigned long addr, pte_t pte,
>> + unsigned long cp_flags)
>> +{
>> + struct page *page;
>> +
>> + if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
>> + /*
>> + * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
>> + * without MM_CP_DIRTY_ACCT, there is nothing to do.
>> + */
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_WRITE))
>> + return false;
>> +
>> + if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
>> + return false;
>
> If pte_write() is already try then return false? I understand you want
> to do so because the page is already writable, but it is confusing.


I thought about just doing outside of the function

if ((vma->vm_flags & VM_WRITE) && !pte_write(pte) &&
can_change_pte_writable()...


I refrained from doing so because the sequence of checks might be
sub-optimal. But most probably we don't really care about that and it
might make the code easier to grasp.

Would that make it clearer?

>
> In addition, I am not sure about the pte_dirty() check is really robust.
> I mean I think it is ok, but is there any issue with shadow-stack?

Judging that it's already used that way for VMAs with dirty tracking, I
assume it's ok. Without checking that the PTE is dirty, we'd have to do a:

pte_mkwrite(pte_mkwrite(ptent));

Which would set the pte and consequently the page dirty, although there
might not even be a write access. That's what we want to avoid here.

>
> And this also assumes the kernel does not clear the dirty bit without
> clearing the present, as otherwise the note in Intel SDM section 4.8
> ("Accessed and Dirty Flags”) will be relevant and dirty bit might be
> set unnecessarily. I think it is ok.

Yeah, I think so as well.

>
>> +
>> + /* Do we need write faults for softdirty tracking? */
>> + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
>> + (vma->vm_flags & VM_SOFTDIRTY))
>
> If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
> think the IS_ENABLED() is necessary (unless you think it is clearer this
> way).

Right, we can just do

if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))

and it should get fully optimized out. Thanks!

>
>> + return false;
>> +
>> + /* Do we need write faults for uffd-wp tracking? */
>> + if (userfaultfd_pte_wp(vma, pte))
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_SHARED)) {
>> + /*
>> + * We can only special-case on exclusive anonymous pages,
>> + * because we know that our write-fault handler similarly would
>> + * map them writable without any additional checks while holding
>> + * the PT lock.
>> + */
>> + page = vm_normal_page(vma, addr, pte);
>
> I guess we cannot call vm_normal_page() twice, once for prot_numa and once
> here, in practice...

I guess we could, but it doesn't necessarily make the code easier to
read :) And we want to skip protnone either way.

>
>> + if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> + return false;
>> + }
>> +
>> + return true;
>> +}
>
> Note that there is a small downside to all of that. Assume you mprotect()
> a single page from RO to RW and you have many threads.
>
> With my pending patch you would avoid the TLB shootdown (and get a PF).
> With this patch you would get a TLB shootdown and save the PF. IOW, I
> think it is worthy to skip the shootdown as well in such a case and
> instead flush the TLB on spurious page-faults. But I guess that’s for
> another patch.

Just so I understand correctly: your optimization avoids the flush when
effectively, nothing changed (R/O -> R/O).

And the optimization for this case here would be, to avoid the TLB flush
when similarly not required (R/O -> R/W).

Correct?

--
Thanks,

David / dhildenb