Re: [PATCH v11 07/14] mm: multi-gen LRU: exploit locality in rmap

From: Yu Zhao
Date: Tue Jun 07 2022 - 21:08:54 EST


On Tue, Jun 7, 2022 at 1:37 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >
> > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:

...

> I can't really explain why we are getting a random app/java vm crash in monkey
> test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> armv8-a machine without hardware PTE young support.
>
> Moving to ptep_clear_flush_young() in look_around can make the random
> hang disappear according to zhanyuan(Cc-ed).

This sounds too familiar -- let me ask again: was the following commit
included during the test?

07509e10dcc7 arm64: pgtable: Fix pte_accessible()

If not, it will cause exactly the problem you described. And what
about this one?

e914d8f00391 mm: fix unexpected zeroed page mapping with zram swap

Missing it also causes userspace memory corruption on Android, i.e.,
random app crashes.

> On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> after
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
>
> But on arm64, they are different. according to Will's comments in this
> thread which
> tried to make arm64 same with x86,
> https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1793881.html
>
> "
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
>
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?
> "
> We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> clear PTE young? Any comments from Will?
>
> >
> > btw, lru_gen_look_around() has already included 'address', are we doing
> > pte check for 'address' twice here?

Explained in the previous reply. Hope that clarifies things.