Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

From: Shakeel Butt
Date: Tue Aug 15 2023 - 21:15:32 EST


On Tue, Aug 15, 2023 at 5:29 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
[...]
> >
> > I thought we already reached the decision on how to proceed here. Let
> > me summarize what I think we should do:
> >
> > 1. Completely remove the sync flush from stat files read from userspace.
> > 2. Provide a separate way/interface to explicitly flush stats for
> > users who want more accurate stats and can pay the cost. This is
> > similar to the stat_refresh interface.
> > 3. Keep the 2 sec periodic stats flusher.
>
> I think this solution is suboptimal to be honest, I think we can do better.
>
> With recent improvements to spinlocks/mutexes, and flushers becoming
> sleepable, I think a better solution would be to remove unified
> flushing and let everyone only flush the subtree they care about. Sync
> flushing becomes much better (unless you're flushing root ofc), and
> concurrent flushing wouldn't cause too many problems (ideally no
> thundering herd, and rstat lock can be dropped at cpu boundaries in
> cgroup_rstat_flush_locked()).
>
> If we do this, stat reads can be much faster as Ivan demonstrated with
> his patch that only flushes the cgroup being read, and we do not
> sacrifice accuracy as we never skip flushing. We also do not need a
> separate interface for explicit refresh.
>
> In all cases, we need to keep the 2 sec periodic flusher. What we need
> to figure out if we remove unified flushing is:
>
> 1. Handling stats_flush_threshold.
> 2. Handling flush_next_time.
>
> Both of these are global now, and will need to be adapted to
> non-unified non-global flushing.

The only thing we are disagreeing on is (1) the complete removal of
sync flush and an explicit flush interface versus (2) keep doing the
sync flush of the subtree.

To me (1) seems more optimal particularly for the server use-case
where a node controller reads stats of root and as well as cgroups of
a couple of top levels (we actually do this internally). Doing flush
once explicitly and then reading the stats for all such cgroups seems
better to me.