Re: [PATCH] mm: vmscan: split khugepaged stats from direct reclaim stats

From: Johannes Weiner
Date: Mon Oct 31 2022 - 12:00:12 EST


On Fri, Oct 28, 2022 at 10:41:17AM -0700, Yosry Ahmed wrote:
> On Fri, Oct 28, 2022 at 7:39 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > pgscan_user: User-requested reclaim. Could be confusing if we ever
> > have an in-kernel proactive reclaim driver - unless that would then go
> > to another counter (new or kswapd).
> >
> > pgscan_ext: Reclaim activity from extraordinary/external
> > requests. External as in: outside the allocation context.
>
> I imagine if the kernel is doing proactive reclaim on its own, we
> might want a separate counter for that anyway to monitor what the
> kernel is doing. So maybe pgscan_user sounds nice for now, but I also
> like that the latter explicitly says "this is external to the
> allocation context". But we can just go with pgscan_user and document
> it properly.

Yes, I think you're right. pgscan_user sounds good to me.

> How would khugepaged fit in this story? Seems like it would be part of
> pgscan_ext but not pgscan_user. I imagine we also don't want to
> pollute proactive reclaim counters with khugepaged reclaim (or other
> non-direct reclaim).
>
> Maybe pgscan_user and pgscan_kernel/pgscan_indirect for things like khugepaged?
> The problem with pgscan_kernel/indirect is that if we add a proactive
> reclaim kthread in the future it would technically fit there but we
> would want a separate counter for it.
>
> I am honestly not sure where to put khugepaged. The reasons I don't
> like a dedicated counter for khugepaged are:
> - What if other kthreads like khugepaged start doing the same, do we
> add one counter per-thread?

It's unlikely there will be more.

The reason khugepaged doesn't rely on kswapd is unique to THP
allocations: they can require an exorbitant amount of work to
assemble, but due to fragmentation those requests may fail
permanently. We don't want to burden a shared facility like kswapd
with large amounts of speculative work on behalf of what are (still*)
cornercase requests.

This isn't true for other allocations. We do have __GFP_NORETRY sites
here and there that rather fall back early than put in the full amount
of work; but overall we expect allocations to succeed - and kswapd to
be able to balance for them!!** - because the alternative tends to be
OOMs, or drivers and workloads aborting on -ENOMEM.

(* As we evolve the allocator and normalize huge page requests
(folios), kswapd may also eventually balance for THPs again. IOW,
it's more likely for this exception to disappear again than it is
that we'll see more of them.)

(** This is also why it's no big deal if other kthreads that rely on
kswapd contribute to direct reclaim stats. First, it's highly
error prone to determine on a case by case basis whether userspace
could be waiting behind that direct reclaim - as Yang Shi's
writeback example demonstrates. Second, if kswapd is overwhelmed,
it's likely to impact userspace *anyway*! The benefit of this
classification work is questionable.)

> - What if we deprecate khugepaged (or such threads)? Seems more likely
> than deprecating kswapd.

If that happens, we can remove the counter again. The bar isn't as
high for vmstat as it for other ABI, and we've updated it plenty of
times to reflect changes in the MM implementation.

> Looks like we want a stat that would group all of this reclaim coming
> from non-direct kthreads, but would not include a future proactive
> reclaim kthread.

I think the desire to generalize overcomplicates things here in a way
that isn't actually meaningful.

Think of direct reclaim stats as a signal that either a) kswapd is
broken or b) memory pressure is high enough to cause latencies in the
class of requests that are of interest to userspace. This is true for
all cases but khugepaged.