Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping

From: Minchan Kim
Date: Thu Apr 16 2015 - 04:30:22 EST


On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote:
> On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote:
> > Hello Mel,
> >
> > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote:
> > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote:
> > > > On Wed, 15 Apr 2015, Rik van Riel wrote:
> > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote:
> > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > > > > > recently accessed by other CPUs. There are many circumstances where this
> > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a
> > > > > > running process as kswapd and the task are likely running on separate CPUs.
> > > > > >
> > > > > > On small machines, this is not a significant problem but as machine
> > > > > > gets larger with more cores and more memory, the cost of these IPIs can
> > > > > > be high. This patch uses a structure similar in principle to a pagevec
> > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends
> > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for
> > > > > > this and one is added for x86. Other architectures will need to decide if
> > > > > > batching like this is both safe and worth the memory overhead. Specifically
> > > > > > the requirement is;
> > > > > >
> > > > > > If a clean page is unmapped and not immediately flushed, the
> > > > > > architecture must guarantee that a write to that page from a CPU
> > > > > > with a cached TLB entry will trap a page fault.
> > > > > >
> > > > > > This is essentially what the kernel already depends on but the window is
> > > > > > much larger with this patch applied and is worth highlighting.
> > > > >
> > > > > This means we already have a (hard to hit?) data corruption
> > > > > issue in the kernel. We can lose data if we unmap a writable
> > > > > but not dirty pte from a file page, and the task writes before
> > > > > we flush the TLB.
> > > >
> > > > I don't think so. IIRC, when the CPU needs to set the dirty bit,
> > > > it doesn't just do that in its TLB entry, but has to fetch and update
> > > > the actual pte entry - and at that point discovers it's no longer
> > > > valid so traps, as Mel says.
> > > >
> > >
> > > This is what I'm expecting i.e. clean->dirty transition is write-through
> > > to the PTE which is now unmapped and it traps. I'm assuming there is an
> > > architectural guarantee that it happens but could not find an explicit
> > > statement in the docs. I'm hoping Dave or Andi can check with the relevant
> > > people on my behalf.
> >
> > A dumb question. It's not related to your patch but MADV_FREE.
> >
> > clean->dirty transition is *atomic* as well as write-through?
>
> This is the TLB cache clean->dirty transition so it's not 100% clear what you
> are asking. It both needs to be write-through and the TLB updates must happen
> before the actual data write to cache or memory and it must be ordered.

Sorry for not clear. I will try again.

In try_to_unmap_one,


pteval = ptep_clear_flush(vma, address, pte);
{
pte = ptep_get_and_clear(mm, address, ptep);
<-------------- A application write on other CPU.
flush_tlb_page(vma, address);
}

/* Move the dirty bit to the physical page now the pte is gone. */
dirty = pte_dirty(pteval);
if (dirty)
set_page_dirty(page);
...


In above, ptep_clear_flush just does xchg operation to make pte zero
in ptep_get_and_clear and return old pte_val but didn't flush TLB yet.
Let's assume old pte_val doesn't have dirty bit(ie, it was clean).
If application on other CPU does write the memory at the same time,
what happens?
I mean (pte cleaning/return old) and (dirty bit setting by CPU itself)
should be exclusive so application on another CPU should encounter
page fault or we should see the dirty bit.
Is it guaranteed?

>
> > I'm really confusing.
> > It seems most arches use xchg for ptep_get_and_clear so it's
> > atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR
> > will use non-atomic version in include/asm-generic/pgtable.h.
> >
> > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> > static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> > unsigned long address,
> > pte_t *ptep)
> > {
> > pte_t pte = *ptep;
> > pte_clear(mm, address, ptep);
> > return pte;
> > }
> > #endif
> >
>
> And if they are using this, they need to be ok that it's not atomic but
> it's not clear what you are asking.
>
> > I hope they have own lock or something to protect a race between software
> > and hardware(ie, CPU set dirty bit by itself).
> >
>
> Or they're UP.

Yeb.

>
> --
> Mel Gorman
> SUSE Labs

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/