Re: Hibernate stuck after recent kernel/workqueue.c changes in Stable 6.6.23

From: Greg KH
Date: Wed Apr 03 2024 - 01:11:18 EST


On Wed, Apr 03, 2024 at 06:26:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 03.04.24 02:38, Tejun Heo wrote:
> > On Tue, Apr 02, 2024 at 10:08:11AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >> Hi stable team, there is a report that the recent backport of
> >> 5797b1c18919cd ("workqueue: Implement system-wide nr_active enforcement
> >> for unbound workqueues") [from Tejun] to 6.6.y (as 5a70baec2294) broke
> >> hibernate for a user. 6.6.24-rc1 did not fix this problem; reverting the
> >> culprit does.
> >>
> >>> With kernel 6.6.23 hibernating usually hangs here: the display stays
> >>> on but the mouse pointer does not move and the keyboard does not work.
> >>> But SysRq REISUB does reboot. Sometimes it seems to hibernate: the
> >>> computer powers down and can be waked up and the previous display comes
> >>> visible, but it is stuck there.
> >>
> >> See https://bugzilla.kernel.org/show_bug.cgi?id=218658 for details.
> >> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
> >> not CCed them in mails like this.
> >>
> >> Side note: there is a mainline report about problems due to
> >> 5797b1c18919cd ("workqueue: Implement system-wide nr_active enforcement
> >> for unbound workqueues") as well, but it's about "nohz_full=0 prevents
> >> kernel from booting":
> >> https://bugzilla.kernel.org/show_bug.cgi?id=218665; will forward that
> >> separately to Tejun.
> >
> > Sorry about late reply
>
> No problem at all, really, thx for your reply!
>
> > but that commit is not for -stable. It does fix an
> > undesirable behavior from an earlier commit, so it has the Fixes tag but it
> > has a series of dependencies that need to be backported together
>
> Not sure if you know, but the stable team apparently recently backported a
> bunch of other workqueue commits as well; I see 10 from a quick look here:
> https://lore.kernel.org/all/20240326223803.2647796-1-sashal@xxxxxxxxxx/
>
> $ git log --grep='workqueue:' --oneline v6.6.22..stable/linux-6.6.y -- include/linux/workqueue.h kernel/workqueue*
> 7df62b8cca38aa workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
> 5a70baec2294e8 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
> b522229a56941a workqueue: Introduce struct wq_node_nr_active
> bd31fb926dfa02 workqueue: RCU protect wq->dfl_pwq and implement accessors for it
> 5f99fee6f2dea1 workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
> 4023a2d9507691 workqueue: Move nr_active handling into helpers
> 6c592f0bb96815 workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
> bad184d26a4f68 workqueue: Factor out pwq_is_empty()
> 82e098f5bed1ff workqueue: Move pwq->max_active to wq->max_active
> 43a181f8f41aca workqueue.c: Increase workqueue name length
>
> And there is also "workqueue: Shorten events_freezable_power_efficient name"
> in the queue for the next 6.6.y release:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.6
>
> That leads to the question: what is the better patch forward here: pick
> up what's is missing or back out?
>
> Side note: I have no idea why the stable team backported those patches
> and no option on whether that was wise, just trying to help finding the best
> solution forward from the current state of things.

The Fixes: tag triggered it, that's why they were backported.

> > which would
> > be far too invasive for -stable, thus no Cc: stable.
> >
> > I didn't know a Fixes
> > tag automatically triggers backport to -stable. I will keep that in mind for
> > future.
>
> /me fears that more and more developers due to situations like this will
> avoid Fixes: tags and wonders what consequences that might have for the
> kernel as a whole

The problem is that we have subsystems that only use Fixes: and not cc:
stable which is why we need to pick these up as well. Fixes: is great,
but if everyone were to do this "properly" then we wouldn't need to pick
these other ones up, but instead, it's about 1/3 of our volume :(

I'll gladly revert the above series if they shouldn't have been
backported to stable, but from reading them, it seemed like they were
fixing an issue that was serious and should have been added to stable,
which is why they were.

This is also why we have review cycles, so maintainers can let us know
if they want us to drop them :)

thanks,

greg k-h