Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

From: Jesper Dangaard Brouer
Date: Thu Apr 18 2024 - 05:02:47 EST




On 18/04/2024 04.19, Yosry Ahmed wrote:
On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:

Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
with a spinlock").

Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
necessary during the collection of per-CPU stats, this approach has led
to several scaling issues observed in production environments. Holding
this IRQ lock has caused starvation of other critical kernel functions,
such as softirq (e.g., timers and netstack). Although kernel v6.8
introduced optimizations in this area, we continue to observe instances
where the spin_lock is held for 64-128 ms in production.

This patch converts cgroup_rstat_lock back to being a mutex lock. This
change is made possible thanks to the significant effort by Yosry Ahmed
to eliminate all atomic context use-cases through multiple commits,
ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
included in kernel v6.5.

After this patch lock contention will be less obvious, as converting this
to a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention. It is recommended to use the
tracepoints to diagnose this.

I will keep the high-level conversation about using the mutex here in
the cover letter thread, but I am wondering why we are keeping the
lock dropping logic here with the mutex?


I agree that yielding the mutex in the loop makes less sense.
Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
will be a preemption point for my softirq. But I kept it because, we
are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
there was no sched point for other userspace processes while holding the
mutex, but I don't fully know the sched implication when holding a mutex.


If this is to reduce lock contention, why does it depend on
need_resched()? spin_needbreak() is a good indicator for lock
contention, but need_resched() isn't, right?


As I said, I'm unsure of the semantics of holding a mutex.


Also, how was this tested?


I tested this in a testlab, prior to posting upstream, with parallel
reader of the stat files. As I said in other mail, I plan to experiment
with these patches(2+3) in production, as micro-benchmarking will not
reveal the corner cases we care about. With BPF based measurements of
the lock congestion time, I hope we can catch production issues at a
time scale that is happens prior to user visible impacts.


When I did previous changes to the flushing logic I used to make sure
that userspace read latency was not impacted, as well as in-kernel
flushers (e.g. reclaim). We should make sure there are no regressions
on both fronts.


Agree, we should consider both userspace readers and in-kernel flushers.
Maybe these needed separate handing as they have separate needs.


Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
---
kernel/cgroup/rstat.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ff68c904e647..a90d68a7c27f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,7 @@

#include <trace/events/cgroup.h>

-static DEFINE_SPINLOCK(cgroup_rstat_lock);
+static DEFINE_MUTEX(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);

static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
{
bool contended;

- contended = !spin_trylock_irq(&cgroup_rstat_lock);
+ contended = !mutex_trylock(&cgroup_rstat_lock);
if (contended) {
trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
- spin_lock_irq(&cgroup_rstat_lock);
+ mutex_lock(&cgroup_rstat_lock);
}
trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
}
@@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
__releases(&cgroup_rstat_lock)
{
trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
- spin_unlock_irq(&cgroup_rstat_lock);
+ mutex_unlock(&cgroup_rstat_lock);
}

/* see cgroup_rstat_flush() */
@@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
}

/* play nice and yield if necessary */
- if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ if (need_resched()) {
__cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();