Re: [PATCH 06/17] arm: mmu_gather rework

From: Peter Zijlstra
Date: Mon Feb 28 2011 - 07:21:12 EST

On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > Right, so the normal case is:
> >
> > unmap_region()
> > tlb_gather_mmu()
> The fullmm argument is important here as it specifies the mode.

well, unmap_region always has that 0, I've mentioned the fullmm mode
separately below, its in many way the easiest case to deal with.

> tlb_gather_mmu(, 0)
> > unmap_vmas()
> > for (; vma; vma = vma->vm_next)
> > unmao_page_range()
> > tlb_start_vma() -> flush cache range
> > zap_*_range()
> > ptep_get_and_clear_full() -> batch/track external tlbs
> > tlb_remove_tlb_entry() -> batch/track external tlbs
> > tlb_remove_page() -> track range/batch page
> > tlb_end_vma() -> flush tlb range
> tlb_finish_mmu() -> nothing
> >
> > [ for architectures that have hardware page table walkers
> > concurrent faults can still load the page tables ]
> >
> > free_pgtables()
> tlb_gather_mmu(, 1)
> > while (vma)
> > unlink_*_vma()
> > free_*_range()
> > *_free_tlb()
> > tlb_finish_mmu()
> tlb_finish_mmu() -> flush tlb mm
> >
> > free vmas
> So this is all fine. Note that we *don't* use the range stuff here.
> > Now, if we want to track ranges _and_ have hardware page table walkers
> > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > because the hardware walker could have re-populated the cache after
> > clearing the PTEs but before freeing the page tables.
> No. The hardware walker won't re-populate the TLB after the page table

Never said it would repopulate the TLB, just said it could repopulate
your cache thing and that it might still walk the page tables.

> entries have been cleared - where would it get this information from if
> not from the page tables?
> > What ARM does is it retains the last vma pointer and tracks
> > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > hacky.
> It may be hacky but then the TLB shootdown interface is hacky too. We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on. The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.

Well, you most certainly use it in the unmap_region() case above.
tlb_end_vma() will do a flush_tlb_range(), but then your
__pte_free_tlb() will also track range and the tlb_finish_mmu() will
then again issue a flush_tlb_range() using the last vma pointer.

You argued you need that second tlb flush fo flush your cached level1
entries for your LPAE mode (btw arm sucks for having all those docs

> > Mostly because of shift_arg_pages(), where we have:
> >
> > shift_arg_pages()
> > tlb_gather_mmu()
> tlb_gather_mmu(, 0)
> > free_*_range()
> > tlb_finish_mmu()
> tlb_finish_mmu() does nothing without the ARM change.
> tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
> And this is where the bug was - these page table entries could find
> their way into the TLB and persist after they've been torn down.

Sure, I got that, you punt and do a full mm tlb invalidate (IA64 and SH
seem similarly challenged).

> > For which ARM now punts and does a full tlb invalidate (no vma pointer).
> > But also because it drags along that vma pointer, which might not at all
> > match the range its actually going to invalidate (and hence its vm_flags
> > might not accurately reflect things -- at worst more expensive than
> > needed).
> Where do you get that from? Where exactly in the above code would the
> VMA pointer get set? In this case, it will be NULL, so we do a
> flush_tlb_mm() for this case. We have to - we don't have any VMA to
> deal with at this point.

unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
range will then get tlb_finish_mmu() to issue a second

> > The reason I wanted flush_tlb_range() to take an mm_struct and not the
> > current vm_area_struct is because we can avoid doing the
> > flush_tlb_range() from tlb_end_vma() and delay the thing until
> > tlb_finish_mmu() without having to resort to such games as above. We
> > could simply track the full range over all VMAs and free'd page-tables
> > and do one range invalidate.
> No. That's stupid. Consider the case where you have to loop one page
> at a time over the range (we do on ARM.) If we ended up with your
> suggestion above, that means we could potentially have to loop 4K at a
> time over 3GB of address space. That's idiotic when we have an
> instruction which can flush the entire TLB for a particular thread.

*blink* so you've implemented flush_tlb_range() as an iteration of
single page invalidates?

x86 could have done the same I think, instead we chose to implement it
as a full mm invalidate simply because that's way cheaper in general.

You could also put a threshold in, if (end-start) >> PAGE_SHIFT > n,
flush everything if you want.

Anyway, I don't see how that's related to the I-TLB thing?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at