Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

From: Yosry Ahmed
Date: Mon Aug 14 2023 - 20:40:47 EST


On Mon, Aug 14, 2023 at 5:35 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Mon, Aug 14, 2023 at 05:28:22PM -0700, Yosry Ahmed wrote:
> > > So, the original design used mutex for synchronize flushing with the idea
> > > being that updates are high freq but reads are low freq and can be
> > > relatively slow. Using rstats for mm internal operations changed this
> > > assumption quite a bit and we ended up switching that mutex with a lock.
> >
> > Naive question, do mutexes handle thundering herd problems better than
> > spinlocks? I would assume so but I am not sure.
>
> I don't know. We can ask Waiman if that becomes a problem.
>
> > > * Flush-side, maybe we can break flushing into per-cpu or whatnot but
> > > there's no avoiding the fact that flushing can take quite a while if there
> > > are a lot to flush whether locks are split or not. I wonder whether it'd
> > > be possible to go back to mutex for flushing and update the users to
> > > either consume the cached values or operate in a sleepable context if
> > > synchronous read is necessary, which is the right thing to do anyway given
> > > how long flushes can take.
> >
> > Unfortunately it cannot be broken down into per-cpu as all flushers
> > update the same per-cgroup counters, so we need a bigger locking
> > scope. Switching to atomics really hurts performance. Breaking down
> > the lock to be per-cgroup is doable, but since we need to lock both
> > the parent and the cgroup, flushing top-level cgroups (which I assume
> > is most common) will lock the root anyway.
>
> Plus, there's not much point in flushing in parallel, so I don't feel too
> enthusiastic about splitting flush locking.
>
> > All flushers right now operate in sleepable context, so we can go
> > again to the mutex if you think this will make things better. The
>
> Yes, I think that'd be more sane.
>
> > slowness problem reported recently is in a sleepable context, it's
> > just too slow for userspace if I understand correctly.
>
> I mean, there's a certain amount of work to do. There's no way around it if
> you wanna read the counters synchronously. The only solution there would be
> using a cached value or having some sort of auto-flushing mechanism so that
> the amount to flush don't build up too much - e.g. keep a count of the
> number of entries to flush and trigger flush if it goes over some threshold.

I really hoped you'd continue reading past this point :)

My proposed solution was to only flush the needed subtree rather than
flushing the entire tree all the time, which is what we do now on the
memcg side. We already have an asynchronous flusher on the memcg side
that runs every 2s to try to keep the tree size bounded, and we
already keep track of the magnitude of updates and only flush if it's
significant.

The problems in this thread and the other one are:
(a) Sometimes reading from userspace is slow because we needlessly
flush the entire tree.
(b) Sometimes reading from userspace is inaccurate because we skip
flushing if someone else is flushing, even though we don't know if
they flushed the subtree we care about yet or not.

I believe dropping unified flushing, if possible of course, may fix
both problems.

>
> Thanks.
>
> --
> tejun