Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
From: Mateusz Guzik
Date: Thu Mar 27 2025 - 10:40:47 EST
On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> >
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> >
> > Would your problem go away toggling this every -- say -- 8 cpus?
>
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
>
> So I don't think we need to do something like this unless a regression
> is observed.
>
To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.
Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.
In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.
So I stand by instead doing this every n CPUs, call it 8 or whatever.
How to repro:
I employed a poor-man's profiler like so:
bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'
This patch or not, execution time varies wildly even while the box is idle.
The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).
A sysctl was added to toggle the new behavior vs old one. Patch at the
end.
"enabled"(1) means new behavior, "disabled"(0) means the old one.
Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled: 903610
enabled: 1006833 (+11.4%)
Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU
I attached the stats I got for reference.
I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
spin_unlock_irq(&cgroup_rstat_lock);
}
+extern unsigned long magic_tunable;
+
/* see cgroup_rstat_flush() */
static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
rcu_read_unlock();
}
- /* play nice and yield if necessary */
- if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ if (READ_ONCE(magic_tunable)) {
__cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();
__cgroup_rstat_lock(cgrp, cpu);
+ } else {
+ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ __cgroup_rstat_unlock(cgrp, cpu);
+ if (!cond_resched())
+ cpu_relax();
+ __cgroup_rstat_lock(cgrp, cpu);
+ }
}
}
}
69869
30473
64670
30544
30950
36445
36235
29920
51179
35760
33424
42426
30177
31211
44974
34450
37871
72642
33016
29518
31800
35730
30326
63507
50113
36280
35148
63329
41232
51265
41341
41418
42824
35200
35550
54684
41597
55325
36120
48675
41179
39339
35794
38826
37411
40676