Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
From: Nadav Amit
Date: Tue Jan 05 2021 - 16:23:53 EST
> On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote:
>>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>>>
>>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
>>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")
>>>
>>> Targeting a backport down to 2013 when nothing could wrong in practice
>>> with page_mapcount sounds backwards and unnecessarily risky.
>>>
>>> In theory it was already broken and in theory
>>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the
>>> previous code of 2013 is completely wrong, but in practice the code
>>> from 2013 worked perfectly until Aug 21 2020.
>>
>> Well… If you consider the bug that Will recently fixed [1], then soft-dirty
>> was broken (for a different, yet related reason) since 0758cd830494
>> ("asm-generic/tlb: avoid potential double flush”).
>>
>> This is not to say that I argue that the patch should be backported to 2013,
>> just to say that memory corruption bugs can be unnoticed.
>>
>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@xxxxxxxxxx/
>
> Is this a fix or a cleanup?
>
> The above is precisely what I said earlier that tlb_gather had no
> reason to stay in clear_refs and it had to use inc_tlb_flush_pending
> as mprotect, but it's not a fix? Is it? I suggested it as a pure
> cleanup. So again no backport required. The commit says fix this but
> it means "clean this up".
It is actually a fix. I think the commit log is not entirely correct and
should include:
Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”).
Since 0758cd830494, calling tlb_finish_mmu() without any previous call to
pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug
producer that I sent fails without this patch of Will.
> So setting a Fixes back to 2013 that would go mess with all stable
> tree by actively backporting a performance regressions to clear_refs
> that can break runtime performance to fix a philosophical issue that
> isn't even a theoretical issue, doesn't sound ideal to me.
Point taken.
>
>> To summarize my action items based your (and others) feedback on both
>> patches:
>>
>> 1. I will break the first patch into two different patches, one with the
>> “optimization” for write-unprotect, based on your feedback. It will not
>> be backported.
>>
>> 2. I will try to add a patch to avoid TLB flushes on
>> userfaultfd-writeunprotect. It will also not be backported.
>
> I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the
> actual fix the memory corruption wasn't ideal, but doing the same
> optimization to both wrprotect and un-wrprotect in the same patch
> sounds ideal. The commit explanation would be identical and it can be
> de-duplicated this way.
>
> I'd suggest to coordinate with Peter on that, since I wasn't planning
> to work on this if somebody else offered to do it.
>
>> 3. Let me know if you want me to use your version of testing
>> mm_tlb_flush_pending() and conditionally flushing, wait for new version fro
>> you or Peter or to go with taking mmap_lock for write.
>
> Yes, as you suggested, I'm trying to clean it up and send a new
> version.
>
> Ultimately my view is there are an huge number of cases where
> mmap_write_lock or some other heavy lock that will require
> occasionally to block on I/O is beyond impossible not to take. Even
> speculative page faults only attack the low hanging anon memory and
> there's still MADV_DONTNEED/FREE and other stuff that may have to run
> in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page
> faults.
>
> As a reminder: the only case when modifying the vmas is allowed under
> mmap_read_lock (I already tried once to make it safer by adding
> READ_ONCE/WRITE_ONCE but wasn't merged see
> https://www.spinics.net/lists/linux-mm/msg173420.html), is when
> updating vm_end/vm_start in growsdown/up, where the vma is extended
> down or up in the page fault under only mmap_read_lock.
>
> I'm doing all I can to document and make it more explicit the
> complexity we deal with in the code (as well as reducing the gcc
> dependency in emitting atomic writes to update vm_end/vm_start, as we
> should do in ptes as well in theory). As you may notice in the
> feedback from the above submission not all even realized that we're
> modifying vmas already under mmap_read_lock. So it'd be great to get
> help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid
> and pending for merge but it needs forward porting.
>
> This one, for both soft dirty and uffd_wrprotect, is a walk in the
> park to optimize in comparison to the vma modifications.
I am sure you are right.
>
> From my point of view in fact, doing the tlb flush or the wait on the
> atomic to be released, does not increase kernel complexity compared to
> what we had until now.
It is also about performance due to unwarranted TLB flushes.
I think avoiding them requires some finer granularity detection of pending
page-faults. But anyhow, I still owe some TLB optimization patches (and v2
for userfaultfd+iouring) before I can even look at that.
In addition, as I stated before, having some clean interfaces that tell
whether a TLB flush is needed or not would be helpful and simpler to follow.
For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure
out whether a TLB flush is needed in change_pte_range() and avoid
unnecessary flushes when unprotecting pages with either mprotect() or
userfaultfd.