Re: [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush

From: Yang Shi
Date: Thu May 09 2019 - 14:24:01 EST

On 5/9/19 3:38 AM, Peter Zijlstra wrote:
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote:
Hi all, [+Peter]
Right, mm/mmu_gather.c has a MAINTAINERS entry; use it.

Sorry for that, I didn't realize we have mmu_gather maintainers. I should run

Also added Nadav and Minchan who've poked at this issue before. And Mel,
because he loves these things :-)

Apologies for the delay; I'm attending a conference this week so it's tricky
to keep up with email.

On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote:
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem). Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), MADV_DONTNEED is the only one who may do page zapping in
parallel and it doesn't remove page tables. But, the forementioned commit
may do munmap() under read mmap_sem and free page tables. This causes a
bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
Please don't _EVER_ refer to external sources to describe the actual bug
a patch is fixing. That is the primary purpose of the Changelog.

Worse, the email you reference does _NOT_ describe the actual problem.
Nor do you.

Sure, will articulate the real bug in the commit log.

wrong page table state to architecture specific TLB flush operations.
Yikes. Is it actually safe to run free_pgtables() concurrently for a given
Yeah.. sorta.. it's been a source of 'interesting' things. This really
isn't the first issue here.

Also, change_protection_range() is 'fun' too.

So, removing __tlb_reset_range() sounds sane. This may cause more TLB
flush for MADV_DONTNEED, but it should be not called very often, hence
the impact should be negligible.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.
I'm still paging the nested flush logic back in, but I have some comments on
the patch below.


Reported-by: Jan Stancek <jstancek@xxxxxxxxxx>
Tested-by: Jan Stancek <jstancek@xxxxxxxxxx>
Cc: Will Deacon <will.deacon@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx>
Signed-off-by: Jan Stancek <jstancek@xxxxxxxxxx>
mm/mmu_gather.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..9fd5272 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
* flush by batching, a thread has stable TLB entry can fail to flush
Urgh, we should rewrite this comment while we're here so that it makes sense...
Yeah, that's atrocious. We should put the actual race in there.

* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
+ *
+ * munmap() may change mapping under non-excluse lock and also free
+ * page tables. Do not call __tlb_reset_range() for it.
- if (mm_tlb_flush_nested(tlb->mm)) {
- __tlb_reset_range(tlb);
+ if (mm_tlb_flush_nested(tlb->mm))
__tlb_adjust_range(tlb, start, end - start);
- }
I don't think we can elide the call __tlb_reset_range() entirely, since I
think we do want to clear the freed_pXX bits to ensure that we walk the
range with the smallest mapping granule that we have. Otherwise couldn't we
have a problem if we hit a PMD that had been cleared, but the TLB
invalidation for the PTEs that used to be linked below it was still pending?
That's tlb->cleared_p*, and yes agreed. That is, right until some
architecture has level dependent TLBI instructions, at which point we'll
need to have them all set instead of cleared.

Perhaps we should just set fullmm if we see that here's a concurrent
unmapper rather than do a worst-case range invalidation. Do you have a feeling
for often the mm_tlb_flush_nested() triggers in practice?
Quite a bit for certain workloads I imagine, that was the whole point of
doing it.

Anyway; am I correct in understanding that the actual problem is that
we've cleared freed_tables and the ARM64 tlb_flush() will then not
invalidate the cache and badness happens?


Because so far nobody has actually provided a coherent description of
the actual problem we're trying to solve. But I'm thinking something
like the below ought to do.

Thanks for the suggestion, will do in v2.

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1dd273..fe768f8d612e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end)
- * If there are parallel threads are doing PTE changes on same range
- * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
- * flush by batching, a thread has stable TLB entry can fail to flush
- * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
- * forcefully if we detect parallel PTE batching threads.
+ * Sensible comment goes here..
- if (mm_tlb_flush_nested(tlb->mm)) {
- __tlb_reset_range(tlb);
- __tlb_adjust_range(tlb, start, end - start);
+ if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
+ /*
+ * Since we're can't tell what we actually should have
+ * flushed flush everything in the given range.
+ */
+ tlb->start = start;
+ tlb->end = end;
+ tlb->freed_tables = 1;
+ tlb->cleared_ptes = 1;
+ tlb->cleared_pmds = 1;
+ tlb->cleared_puds = 1;
+ tlb->cleared_p4ds = 1;