Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

From: Thomas Gleixner
Date: Wed Mar 27 2024 - 16:42:28 EST


On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> On 22/03/24 14:22, Frederic Weisbecker wrote:
>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> Now, on top of the above, there's one more thing worth noting:
> cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
the change log.

TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
cpu_hotplug_lock in the hotplug path for whatever silly reason. So
trying to flush the work from within the cpu hotplug lock write held
region is guaranteed to dead lock.

That's why all of this got deferred to a work queue. The flush at the
end of the hotplug code after dropping the hotplug lock is there to
prevent that user space sees the CPU hotplug uevent before the work is
done. But of course between bringing the CPU offline and the flush the
kernel internal state is inconsistent.

There was an attempt to make the CPU hotplug path synchronous in commit
a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
reverted with commit 2b729fe7f3 for the very wrong reason:

https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@xxxxxx/T/#u

(cpu_hotplug_lock){++++}-{0:0}:
lock_acquire+0xe4/0x25c
cpus_read_lock+0x50/0x154
static_key_slow_inc+0x18/0x30
mem_cgroup_css_alloc+0x824/0x8b0
cgroup_apply_control_enable+0x1d8/0x56c
cgroup_apply_control+0x40/0x344
cgroup_subtree_control_write+0x664/0x69c
cgroup_file_write+0x130/0x2e8
kernfs_fop_write+0x228/0x32c
__vfs_write+0x84/0x1d8
vfs_write+0x13c/0x1b4
ksys_write+0xb0/0x120

Instead of the revert this should have been fixed by doing:

+ cpus_read_lock();
mutex_lock();
mem_cgroup_css_alloc();
- static_key_slow_inc();
+ static_key_slow_inc_cpuslocked();

Sorry that I did not notice this back then because I was too focussed on
fixing that uevent nonsense in a halfways sane way.

After that revert cpuset locking became completely insane.

cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
advanced part of that insanity. Seriously this commit is just tasteless
and disgusting demonstration of how to paper over a problem which is not
fully understood.

After staring at it some more (including the history which led up to
these insanities) I really think that the CPU hotplug path can be made
truly synchronous and the memory hotplug path should just get the lock
ordering correct.

Can we please fix this for real and get all of this insanity out of the
way including the nonsensical comments in the cpuset code:

* Call with cpuset_mutex held. Takes cpus_read_lock().

...

lockdep_assert_cpus_held();
lockdep_assert_held(&cpuset_mutex);

Oh well...

Thanks,

tglx