Re: [PATCH V11] cgroup/rstat: Avoid flushing if there is an ongoing root flush

From: Yosry Ahmed
Date: Thu Sep 12 2024 - 17:42:11 EST


On Thu, Sep 12, 2024 at 10:07 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:
>
> This patch reintroduces and generalizes the "stats_flush_ongoing" concept
> to avoid redundant flushes if there is an ongoing flush at cgroup root
> level, addressing production lock contention issues on the global cgroup
> rstat lock.
>
> In this revision userspace readers will wait for the ongoing flusher to
> complete before returning, to avoid reading out-dated stats just before
> they get updated. Generally in-kernel users will attempt to skip the
> flush in-order to get out of the lock contention state. Some in-kernel
> users of the cgroup_rstat_flush() API depend on waiting for the flush to
> complete before continuing. This patch introduce the call
> cgroup_rstat_flush_relaxed() with a wait_for_flush option to satisfy both
> use-cases.
>
> At Cloudflare, we observed significant performance degradation due to
> lock contention on the rstat lock, primarily caused by kswapd. The
> specific mem_cgroup_flush_stats() call inlined in shrink_node, which
> takes the rstat lock, is particularly problematic.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we
> noted severe lock contention on the rstat lock, causing 12 CPUs to waste
> cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat)
> for kthreads revealed that we are burning an average of 20,000 CPU cores
> fleet-wide on kswapd, primarily due to spinning on the rstat lock.
>
> Here's a brief overview of the issue:
> - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN threads
> to wake up simultaneously.
> - The kswapd thread invokes shrink_node (via balance_pgdat), triggering the
> cgroup rstat flush operation as part of its work.
> - balance_pgdat() has a NULL value in target_mem_cgroup, causing
> mem_cgroup_flush_stats() to flush with root_mem_cgroup.
>
> The kernel previously addressed this with a "stats_flush_ongoing" concept,
> which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree stats
> flushing"). This patch reintroduces and generalizes the concept to apply to
> all users of cgroup rstat, not just memcg.
>
> It have been a general theme to replace mem_cgroup_flush_stats() with
> mem_cgroup_flush_stats_ratelimited every time we see a new case of this
> issue. This will hide the contention issue until something starves the
> kthread that does the periodic 2 second flush (for 2 periods). In
> production we are seeing kthreads getting starved longer than 20 seconds.
> This often happens in connection with OOM killer. This recreates the
> kswapd lock contention situation at a very unfortunate point in time.
> Thus, it makes sense to have this ongoing flusher lock contention
> protection in place.
>
> In this patch only a root cgroup can become the ongoing flusher, as this solves
> the production issue. Letting other levels becoming ongoing flusher cause root
> cgroup to contend on the lock again.
>
> This change significantly reduces lock contention, especially in
> environments with multiple NUMA nodes, thereby improving overall system
> performance.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
> ---

To reiterate my response on v10, I prefer that this problem is handled
at the reclaim flushing site. This started as a nice simple and
generic approach, but ended up being complex and tailored to handle
the reclaim flushing case.

I do have some comments on the current implementation nonetheless.

> v11:
> - Address Yosry request to wait-for-flush for userspace readers
>
> V10: https://lore.kernel.org/all/172547884995.206112.808619042206173396.stgit@firesoul/
>
> block/blk-cgroup.c | 2 -
> include/linux/cgroup.h | 1
> include/linux/memcontrol.h | 1
> kernel/cgroup/rstat.c | 133 ++++++++++++++++++++++++++++++++++++++++++--
> mm/memcontrol.c | 40 +++++++++----
> mm/vmscan.c | 2 -
> mm/zswap.c | 2 -
> 7 files changed, 157 insertions(+), 24 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 37e6cc91d576..058393e7665a 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1200,7 +1200,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
> if (!seq_css(sf)->parent)
> blkcg_fill_root_iostats();
> else
> - cgroup_rstat_flush(blkcg->css.cgroup);
> + cgroup_rstat_flush_relaxed(blkcg->css.cgroup, true);
>
> rcu_read_lock();
> hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 2150ca60394b..ff65bc100ca5 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -691,6 +691,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> void cgroup_rstat_flush(struct cgroup *cgrp);
> void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> void cgroup_rstat_flush_release(struct cgroup *cgrp);
> +int cgroup_rstat_flush_relaxed(struct cgroup *cgrp, bool wait_for_flush);

We now have 4 different flavors of rstat flushing:

1. cgroup_rstat_flush() -> normal flush: lock, flush, unlock
2. cgroup_rstat_flush_hold() -> same, but keep the lock held
3. cgroup_rstat_flush_relaxed(wait_for_flush=true) -> if someone is
already flushing us, wait, otherwise normal
4. cgroup_rstat_flush_relaxed(wait_for_flush=false) -> if someone is
already flushing us, wait, otherwise normal

Why do we need the third one? Can't we just keep using (1) for now for
userspace readers?

>
> /*
> * Basic resource stats.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 030d34e9d117..7e24c5e1327f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1026,6 +1026,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx);
>
> void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
> +void mem_cgroup_flush_stats_relaxed(struct mem_cgroup *memcg);
> void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
>
> void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index a06b45272411..80a4b949138f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -11,6 +11,9 @@
>
> static DEFINE_SPINLOCK(cgroup_rstat_lock);
> static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL;
> +static struct task_struct *cgrp_rstat_ongoing_flusher_ID = NULL;

rstat_ongoing_flush_cgrp and rstat_ongoing_flush_task are probably
clearer names.

> +static DEFINE_MUTEX(cgrp_rstat_ongoing_flusher_serialize);
>
> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>
> @@ -299,6 +302,68 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> spin_unlock_irq(&cgroup_rstat_lock);
> }
>
> +static inline bool cgroup_is_root(struct cgroup *cgrp)
> +{
> + return cgroup_parent(cgrp) == NULL;
> +}
> +
> +/**
> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flusher
> + * @cgrp: target cgroup
> + * @strict: always lock and ignore/skip ongoing flusher checks
> + *
> + * Function return value follow trylock semantics. Returning true when lock is
> + * obtained. Returning false when not locked and it detected flushing can be
> + * skipped as another ongoing flusher is taking care of the flush.
> + *
> + * For callers that depend on flush completing before returning a strict option
> + * is provided.
> + */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool strict)
> +{
> + struct cgroup *ongoing;
> +
> + if (strict)
> + goto lock;

Might as well just have a cgroup_rstat_lock_flusher() function instead
of the boolean parameter, which is also consistent with the
lock/trylock semantics you are following.

> +
> + /*
> + * Check if ongoing flusher is already taking care of this. Descendant
> + * check is necessary due to cgroup v1 supporting multiple root's.
> + */
> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> + if (ongoing && cgroup_is_descendant(cgrp, ongoing))

The ongoing flusher may be going away, so cgroup_is_descendant() may
be a UAF as I pointed out a few times. I actually think even taking a
ref here may not work as it may be a flush from cgroup_rstat_exit().

One thing we can do is get the root of cgrp (probably
cgrp->root->cgrp?) and compare it to cgrp_rstat_ongoing_flusher
without ever dereferencing it. If the ongoing flusher is in fact the
root of cgrp, this implies a ref on it anyway so we can dereference it
if needed.

That would obviously need a comment to explain it.

> + return false;
> +
> + /* Grab right to be ongoing flusher */
> + if (!ongoing && cgroup_is_root(cgrp)) {
> + struct cgroup *old;
> +
> + old = cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp);
> + if (old) {
> + /* Lost race for being ongoing flusher */
> + if (cgroup_is_descendant(cgrp, old))
> + return false;
> + }
> + /* Due to lock yield combined with strict mode record ID */

This needs a more detailed comment.

> + WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current);

This will overwrite the ID if we lost the race but are not a
descendant of the flusher that one the race, right?

> + }
> +lock:
> + __cgroup_rstat_lock(cgrp, -1);
> +
> + return true;
> +}
> +
> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
> +{
> + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher) &&
> + READ_ONCE(cgrp_rstat_ongoing_flusher_ID) == current) {
> + WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, NULL);
> + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
> + }
> +
> + __cgroup_rstat_unlock(cgrp, -1);
> +}
> +
> /* see cgroup_rstat_flush() */
> static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
> @@ -333,6 +398,19 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> }
> }
>
> +static int __cgroup_rstat_flush(struct cgroup *cgrp, bool strict)
> +{
> + might_sleep();
> +
> + if (!cgroup_rstat_trylock_flusher(cgrp, strict))
> + return false;
> +
> + cgroup_rstat_flush_locked(cgrp);
> + cgroup_rstat_unlock_flusher(cgrp);
> +
> + return true;
> +}
> +
> /**
> * cgroup_rstat_flush - flush stats in @cgrp's subtree
> * @cgrp: target cgroup
> @@ -348,11 +426,49 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> */
> __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> {
> - might_sleep();
> + __cgroup_rstat_flush(cgrp, true);
> +}
>
> - __cgroup_rstat_lock(cgrp, -1);
> - cgroup_rstat_flush_locked(cgrp);
> - __cgroup_rstat_unlock(cgrp, -1);
> +int cgroup_rstat_flush_relaxed(struct cgroup *cgrp, bool wait_for_flush)
> +{
> + bool flushed = __cgroup_rstat_flush(cgrp, false);
> +
> + if (!flushed && wait_for_flush) {

Isn't the code below essentially open-coding completions in a less
efficient way? Anyway as I mentioned above I don't believe we really
need wait_for_flush for now. I was hoping we can make this work for
both in-kernel and userspace flushers, but I don't think we want to
add it just for userspace flushers.

> + /*
> + * Reaching here we know an ongoing flusher is running, that
> + * will take care of flushing for us, but for caller to read
> + * accurate stats we want to wait for this ongoing flusher.
> + *
> + * TODO: When lock becomes mutex and no-yielding this code can
> + * be simplifed as we can just sleep on the mutex lock.
> + */
> + struct task_struct *id, *cur_id;
> + u64 timeout;
> +
> + id = READ_ONCE(cgrp_rstat_ongoing_flusher_ID);
> + timeout = jiffies_64 + msecs_to_jiffies(50);
> +
> + if (!id)
> + return false;
> +
> + cond_resched();
> + /* We might get lucky and flush already completed */
> + cur_id = READ_ONCE(cgrp_rstat_ongoing_flusher_ID);
> +
> + /* Due to lock yield, make sure "id" flusher completes */
> + while (cur_id == id && time_before64(jiffies_64, timeout)) {
> + cond_resched();
> + /* Use mutex to reduce stress on global lock */
> + mutex_lock(&cgrp_rstat_ongoing_flusher_serialize);
> + __cgroup_rstat_lock(cgrp, -1);
> + /* Get lock with ongoing can happen due to yielding */
> + cur_id = READ_ONCE(cgrp_rstat_ongoing_flusher_ID);
> + __cgroup_rstat_unlock(cgrp, -1);
> + mutex_unlock(&cgrp_rstat_ongoing_flusher_serialize);
> + }
> + }
> +
> + return flushed;
> }
>
> /**
> @@ -368,8 +484,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
> __acquires(&cgroup_rstat_lock)
> {
> might_sleep();
> - __cgroup_rstat_lock(cgrp, -1);
> - cgroup_rstat_flush_locked(cgrp);
> +
> + if (cgroup_rstat_trylock_flusher(cgrp, false))
> + cgroup_rstat_flush_locked(cgrp);
> + else
> + __cgroup_rstat_lock(cgrp, -1);
> }
>
> /**
> @@ -379,7 +498,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
> void cgroup_rstat_flush_release(struct cgroup *cgrp)
> __releases(&cgroup_rstat_lock)
> {
> - __cgroup_rstat_unlock(cgrp, -1);
> + cgroup_rstat_unlock_flusher(cgrp);
> }
>
> int cgroup_rstat_init(struct cgroup *cgrp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71fe2a95b8bd..6694f7a859b5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -871,12 +871,26 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> }
> }
>
> -static void do_flush_stats(struct mem_cgroup *memcg)
> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait_for_flush)
> {
> - if (mem_cgroup_is_root(memcg))
> - WRITE_ONCE(flush_last_time, jiffies_64);
> + bool flushed = cgroup_rstat_flush_relaxed(memcg->css.cgroup,
> + wait_for_flush);
>
> - cgroup_rstat_flush(memcg->css.cgroup);
> + if (mem_cgroup_is_root(memcg) && flushed)
> + WRITE_ONCE(flush_last_time, jiffies_64);
> +}
> +
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg,
> + bool wait_for_flush)
> +{
> + if (mem_cgroup_disabled())
> + return;
> +
> + if (!memcg)
> + memcg = root_mem_cgroup;
> +
> + if (memcg_vmstats_needs_flush(memcg->vmstats))
> + do_flush_stats(memcg, wait_for_flush);
> }
>
> /*
> @@ -890,21 +904,19 @@ static void do_flush_stats(struct mem_cgroup *memcg)
> */
> void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> {
> - if (mem_cgroup_disabled())
> - return;
> -
> - if (!memcg)
> - memcg = root_mem_cgroup;
> + __mem_cgroup_flush_stats(memcg, true);
> +}
>
> - if (memcg_vmstats_needs_flush(memcg->vmstats))
> - do_flush_stats(memcg);
> +void mem_cgroup_flush_stats_relaxed(struct mem_cgroup *memcg)
> +{
> + __mem_cgroup_flush_stats(memcg, false);
> }
>
> void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> {
> /* Only flush if the periodic flusher is one full cycle late */
> if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
> - mem_cgroup_flush_stats(memcg);
> + mem_cgroup_flush_stats_relaxed(memcg);

mem_cgroup_flush_stats_ratelimited() now basically says: only flush if
the periodic flusher is late but no one else is flushing, if we end up
pursuing this we probably want to start documenting these variants.