Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

From: Sean Christopherson
Date: Thu Feb 23 2023 - 14:12:24 EST


On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > kswapd (MGLRU before)
> > > 100.00% balance_pgdat
> > > 100.00% shrink_node
> > > 100.00% shrink_one
> > > 99.97% try_to_shrink_lruvec
> > > 99.06% evict_folios
> > > 97.41% shrink_folio_list
> > > 31.33% folio_referenced
> > > 31.06% rmap_walk_file
> > > 30.89% folio_referenced_one
> > > 20.83% __mmu_notifier_clear_flush_young
> > > 20.54% kvm_mmu_notifier_clear_flush_young
> > > => 19.34% _raw_write_lock
> > >
> > > kswapd (MGLRU after)
> > > 100.00% balance_pgdat
> > > 100.00% shrink_node
> > > 100.00% shrink_one
> > > 99.97% try_to_shrink_lruvec
> > > 99.51% evict_folios
> > > 71.70% shrink_folio_list
> > > 7.08% folio_referenced
> > > 6.78% rmap_walk_file
> > > 6.72% folio_referenced_one
> > > 5.60% lru_gen_look_around
> > > => 1.53% __mmu_notifier_test_clear_young
> >
> > Do you happen to know how much of the improvement is due to batching, and how
> > much is due to using a walkless walk?
>
> No. I have three benchmarks running at the moment:
> 1. Windows SQL server guest on x86 host,
> 2. Apache Spark guest on arm64 host, and
> 3. Memcached guest on ppc64 host.
>
> If you are really interested in that, I can reprioritize -- I need to
> stop 1) and use that machine to get the number for you.

After looking at the "MGLRU before" stack again, it's definitely worth getting
those numbers. The "before" isn't just taking mmu_lock, it's taking mmu_lock for
write _and_ flushing remote TLBs on _every_ PTE. I suspect the batching is a
tiny percentage of the overall win (might be larger with RETPOLINE and friends),
and that the bulk of the improvement comes from avoiding the insanity of
kvm_mmu_notifier_clear_flush_young().

Speaking of which, what would it take to drop mmu_notifier_clear_flush_young()
entirely? I.e. why can MGLRU tolerate stale information but !MGLRU cannot? If
we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_clear_young()
instead, would anyone notice, let alone care?

> > > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> > >
> > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > > + caps |= BIT(LRU_GEN_SPTE_WALK);
> >
> > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > mmu_notifiers. But don't call into KVM directly.
>
> I'm not sure I fully understand. Let's present the problem on the MM
> side: assuming KVM supports lockless walks, batching can still be
> worse (very unlikely), because GFNs can exhibit no memory locality at
> all. So this option allows userspace to disable batching.

I'm asking the opposite. Is there a scenario where batching+lock is worse than
!batching+lock? If not, then don't make batching depend on lockless walks.

> I fully understand why you don't want MM to call into KVM directly. No
> acceptable ways to set up a clear interface between MM and KVM other
> than the MMU notifier?

There are several options I can think of, but before we go spend time designing
the best API, I'd rather figure out if we care in the first place.