Re: [PATCH V10] cgroup/rstat: Avoid flushing if there is an ongoing root flush

From: Yosry Ahmed
Date: Thu Sep 12 2024 - 12:35:44 EST


>
> [..]
> >>
> >> My userspace readers (of /sys/fs/cgroup/*/*/memory.stat) are primarily
> >> cadvisor and systemd, which doesn't need this accuracy.
> >>
> >> Can you explain your userspace use-case that need this accuracy?
> >
> >
> > Please look at the commit log of this patch [1] that removed
> > stats_flush_ongoing (the one in Fixes).
> >
> > [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/
> >
>
> I think I understand the inaccuracy concern now. So, it is the 2 second
> periodic flush that is the concern. That can cause up-to 2 seconds old
> data to be read by userspace, when read contend with ongoing root
> flusher (if it doesn't wait for the flush). I agree, that 2 sec old
> data is too inaccurate.

Yeah and it's not just the delay, it's also the fact that you may read
the stats *after* an event (e.g. proactive reclaim or memory usage
spike), and get the stats from *before* the event. This leads to wrong
decisions being made by userspace.

>
>
> I'm coding V11 that will "wait_for_flush" in the userspace call paths.

I'd rather we keep the flushers consistent to avoid complexity if
possible (i.e. if Shakeel's patch works for you). But if that's the
only way going forward maybe that's what we need for now.. I will wait
to see the code.

[..]
>
>
> I like the idea of eliminate all in-kernel flushers.
> Until someone works on that I will use my patch in production as I need
> to fix the production issues ASAP.

I took a brief look at the ones in the reclaim path and I don't think
they are easy to remove tbh. Maybe MGLRU will become the default soon
and they naturally go away (MGLRU does not have these flushes).

[..]
>
> >
> > If it takes one kswapd thread 24 ms to flush the stats, then that's
> > the raw flush time. If all 12 kswapd threads started at different
> > times they would all spend 24 ms flushing anyway, so waiting for the
> > ongoing flusher is not a regression or a newly introduced delay. The
> > ongoing flusher mechanism rather tries to optimize this by avoiding
> > the lock contention and waiting for the ongoing flusher rather than
> > competing on the lock and redoing some of the work.
> >
>
> We are observing kswapd isn't running "fast-enough" in production (e.g.
> when packet process using GFP_ATOMIC alloc are starting to fail). Thus,
> I really don't want to delay 11 other kswapd threads, by waiting on a
> flush, I really prefer to skip the flush, such that they can do the much
> needed memory reclaim work.

That's one more reason I think maybe this needs to be handled in the
reclaim path. I do not think other flushers relate to this situation.
It would be nice if we can just skip the flush in the reclaim path
under these circumstances.

>
> >>
> >>
> >> I need remind people that "completely skipping the flush" due to ongoing
> >> flusher have worked well for us before kernel v6.8 (before commit
> >> 7d7ef0a4686a). So, I really don't see skipping the flush, when there is
> >> an ongoing flusher is that controversial.
> >
> >
> > Skipping the flush was introduced in v5.15 as part of aa48e47e3906
> > ("memcg: infrastructure to flush memcg stats"). Before then, reading
> > the stats from userspace was as accurate as possible. When we moved to
> > a kernel with that commit we noticed the regression. So it wasn't
> > always the case that userspace reads were inaccurate or did not flush.
> >>
> >> I think it is controversial to *wait* for the ongoing flusher as that
> >> IMHO defeats the whole purpose of having an ongoing flusher...
> >
> >
> > The point of having an ongoing flusher is to avoid reacquiring the
> > lock after they are done, and checking all the percpu trees again for
> > updates, which would be a waste of work and unnecessary contention on
> > the lock. It's definitely an improvement over directly competing over
> > the lock, yet it doesn't sacrifice accuracy.
> >
> >>
> >> then we could just have a normal mutex lock if we want to wait.
> >
> >
> > I am not against using a mutex as I mentioned before. If there are
> > concerns about priority inversions we can add a timeout as we
> > discussed. The ongoing flusher mechanism is similar in principle to a
> > mutex, the advantage is that whoever holds the lock does not sleep, so
> > it gets the flush done faster and waiters wake up faster.
> >
>
> My plan is to lower contention on this rstat lock "enough" (e.g. with
> this patch), which should make it safer to switch to a mutex.

Do we know that switching to a mutex in the current state is
problematic? If the only concern is priority inversion then a timeout
should work.

Also, I don't think a mutex will help your case of kswapd not running
fast enough, right?

[..]
>
>
> >>
> >> It is not only the flush in the kswapd path that concerns me.
> >> My other concern is userspace cadvisor that periodically reads ALL the
> >> .stat files on the system and creates flush spikes (every minute). When
> >> advisor collides with root-cgroup flush (either 2 sec periodic or
> >> kswapd) then bad interactions happens in prod.
> >
> >
> > I believe the problem here is the kswapd flushers competing with
> > cadvisor userspace read. I don't think the periodic flusher that runs
> > every 2s colliding with the cadvisor reader that runs every minute
> > would really cause a problem. Also both of these paths should not be
> > latency sensitive anyway. So again, Shakeel's patch should help here.
> >
> > Did you check if Shakeel's patch fixes your problem?
> >
>
> I will also try out Shakeel's patch. This will hide the specific
> contention issue until something starves the kthread that does the
> periodic 2 second flush (for 2 periods). In production we are seeing
> kthreads getting starved longer than 20 seconds. This often happens in
> connection with OOM killer. This recreates the kswapd lock contention
> situation at a very unfortunate point in time. Thus, it makes sense to
> have this ongoing flusher lock contention protection in place.

Yeah we may need to completely skip the flush under dire circumstances
in the reclaim path, as I mentioned above. This feels more like a
reclaim problem than an rstat problem at this point.

>
>
> BTW, there is still a mem_cgroup_flush_stats() remaining in
> zswap_shrinker_count(), that we might still hit, after Shakeel's patch.
> And a direct call to do_flush_stats() in obj_cgroup_may_zswap().

Yeah there is a plan to remove these.

Nhat, are you currently working on this? If not I can try to find a
few cycles to address this.

[..]
>
> >>>
> >>> I thought we agreed to wait for the ongoing flusher to complete, but
> >>> only allow root cgroups to become the ongoing flusher (at least
> >>> initially). Not sure what changed.
> >>>
> >>
> >> Practical implementation (with completions) and production experiments
> >> is what changed my mind. Thus, I no-longer agree that waiting for the
> >> ongoing flusher to complete is the right solution.
> >
> >
> > My understanding based on [1] was that the ongoing flusher mechanism
> > with only root flushers fixed the problem, but maybe you got more data
> > afterward.
> >
>
> [1] is V8 (that allowed ongoing flusher below level 2) which production
> experience from that shows that we need to confine ongoing flusher to
> being the root cgroup only. (This already worked in V2). The V9
> production experience shows that using 'completions' caused issues and
> implementing this race free is very hard.


So you observed production problems with completions even when only
root cgroups are allowed to be the ongoing flusher, after you thought
it fixed the problem initially. Did I get that right?