Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
From: Mateusz Guzik
Date: Wed Mar 26 2025 - 19:57:51 EST
On Wed, Mar 19, 2025 at 6:26 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> ...
> > Is not this going a little too far?
> >
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
>
> Lock/unlock when the cacheline is already on the cpu is pretty cheap and on
> modern x86 cpus at least irq on/off are also only a few cycles, so you
> probably wouldn't be able to tell the difference.
>
This does not line up with what I had seen on x86-64 uarchs up to this
point, including Sapphire Rapids, which I think still counts as
reasonably new.
Most notably I see smp_mb() as a factor on my profiles, which compiles
to lock add $0 to a stack pointer -- a very much cache hot variable.
However, the cost of an atomic op has a lot of to do with state
accumulated around it -- the more atomics in short succession, the
lesser the impact. That is to say it may be given code is slow enough
that adding a lock-prefixed instruction does not add notable overhead.
In order to not just handwave, here is overhead of __legitimize_mnt()
while performing a path lookup for access() on my test box. smp_mb()
is the stock variant. irq() merely toggles interrupts, it does not
provide any sanity for the routine, but lets me see the cost if it got
planted there for some magic reason. Finally the last bit is what I
expect the final routine to have -- merely a branch to account for bad
mounts (taking advantage of synchronize_rcu() on unmount).
smp_mb: 3.31%
irq: 1.44%
nothing: 0.63%
These would be higher if it was not for the fact that memory
allocation (for path buffer) is dog slow.
And indeed I get over a 3% speed up in access() rate by not suffering
that smp_mb() [I note I don't have a viable patch for it, rather I
whacked it for benchmarking purposes to see if pursuing proper removal
is worth it]
I'll note though that profiling open()+close() shows virtually no
difference (less than 0.5%) -- but that's not an argument for atomics
being cheap, instead it is an argument for open() in particular being
slow (which it is, I'm working on it though).
If you want I can ship you the test case, diffs and kernel config to
reproduce on your own.
--
Mateusz Guzik <mjguzik gmail.com>