Re: [PATCH] mm: memcg: provide accurate stats for userspace reads
From: Tejun Heo
Date: Mon Aug 14 2023 - 20:15:07 EST
Hello,
On Sat, Aug 12, 2023 at 04:04:32AM -0700, Yosry Ahmed wrote:
> Taking a step back though, and considering there have been other
> complaints about unified flushing causing expensive reads from
> memory.stat [1], I am wondering if we should tackle the fundamental
> problem.
>
> We have a single global rstat lock for flushing, which protects the
> global per-cgroup counters as far as I understand. A single lock means
> a lot of contention, which is why we implemented unified flushing on
> the memcg side in the first place, where we only let one flusher
> operate and everyone else skip, but that flusher needs to flush the
> entire tree.
>
> This can be unnecessarily expensive (see [1]), and to avoid how
> expensive it is we sacrifice accuracy (what this patch is about). I am
> exploring breaking down that lock into per-cgroup locks, where a
> flusher acquires locks in a top down fashion. This allows for some
> concurrency in flushing, and makes unified flushing unnecessary. If we
> retire unified flushing we fix both accuracy and expensive reads at
> the same time, while not sacrificing performance for concurrent
> in-kernel flushers.
>
> What do you think? I am prototyping something now and running some
> tests, it seems promising and simple-ish (unless I am missing a big
> correctness issue).
So, the original design used mutex for synchronize flushing with the idea
being that updates are high freq but reads are low freq and can be
relatively slow. Using rstats for mm internal operations changed this
assumption quite a bit and we ended up switching that mutex with a lock.
Here are some suggestions:
* Update-side, per-cpu lock should be fine. I don't think splitting them
would buy us anything meaningful.
* Flush-side, maybe we can break flushing into per-cpu or whatnot but
there's no avoiding the fact that flushing can take quite a while if there
are a lot to flush whether locks are split or not. I wonder whether it'd
be possible to go back to mutex for flushing and update the users to
either consume the cached values or operate in a sleepable context if
synchronous read is necessary, which is the right thing to do anyway given
how long flushes can take.
Thanks.
--
tejun