Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch

From: Paul E. McKenney
Date: Tue Oct 01 2024 - 07:37:36 EST


On Tue, Oct 01, 2024 at 11:07:49AM +0200, Michal Hocko wrote:
> On Tue 01-10-24 10:22:51, Greg KH wrote:
> > On Tue, Oct 01, 2024 at 10:02:02AM +0200, Petr Mladek wrote:
> > > On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> > > > Description
> > > > ===========
> > > >
> > > > In the Linux kernel, the following vulnerability has been resolved:
> > > >
> > > > workqueue: Improve scalability of workqueue watchdog touch
> > > >
> > > > On a ~2000 CPU powerpc system, hard lockups have been observed in the
> > > > workqueue code when stop_machine runs (in this case due to CPU hotplug).
> > >
> > > I believe that this does not qualify as a security vulnerability.
> > > Any hotplug is a privileged operation.
> >
> > Really? I see that happen on many embedded systems all the time, they
> > add/remove CPUs while the device runs/sleeps constantly.
>
> This is a powerpc specific fix. Other architectures are not affected.
>
> > Now to be fair, right now an "embedded system" usually doesn't have 2000
> > cpus, but what's wrong with marking this real bugfix as a vulnerability
> > resolution?
>
> Yes, this is indeed a scalability fix for huge systems with a lot of
> CPUs anybody owning those systems was simply not able to use memory
> hotplug without seeing those hard lockup messages. The system is not
> really locked up. The progress of the hotplug operation is just utterly
> slow. Calling this a vulnerability is a stretch IMHO.
>
> The only potential attack vector is to have machine configured to panic
> on hard lockups on those huge ppc systems and allow cpu hotremove to an
> adversary which in itsels seems like a very bad idea anyway because
> availability of such a system is then effectively compromised.

If the attacker can do CPU hotplug, then an effective (though admittedly
non-CVE) attack is to simply offline all but one of the CPUs. Whatever
that system was doing with its 2,000 CPUs, it is unlikely to be doing
with only one of them.

And taking Michal's point further, if the load rises high enough, you
might get various types of lockups, and the system might be configured
to panic. For example, the load resulting from dumping 2000 CPUs worth of
workload onto a single CPU could easily starve RCU's grace-period kthread
for the 21 seconds required to result in an RCU CPU stall warning. And if
the system has sysctl_panic_on_rcu_stall set, then the system will panic.

But this really should be considered to be expected behavior given
privileged abuse rather than a vulnerability, correct?

Thanx, Paul