[PATCH 0/1] sched: Use printk_deferred() in __schedule()

From: Stephen Brennan
Date: Wed Feb 09 2022 - 17:14:08 EST


Hello all,

I have a customer whose systems are encountering a deadlock due to a
warning in the perf_cgroup_switch function, which is called during
__schedule:

#5 [ffffa82fd8fd3700] native_queued_spin_lock_slowpath at ffffffff8ef0ef2b
#6 [ffffa82fd8fd3708] queued_spin_lock_slowpath at ffffffff8ef0eafb
#7 [ffffa82fd8fd3718] _raw_spin_lock at ffffffff8f805a93
#8 [ffffa82fd8fd3728] try_to_wake_up at ffffffff8eee1f8e
#9 [ffffa82fd8fd3790] wake_up_process at ffffffff8eee2425
#10 [ffffa82fd8fd37a0] wake_up_worker at ffffffff8eec67b8
#11 [ffffa82fd8fd37b0] insert_work at ffffffff8eec7997
#12 [ffffa82fd8fd37f0] __queue_work at ffffffff8eec7af2
#13 [ffffa82fd8fd3838] queue_work_on at ffffffff8eec82f8
#14 [ffffa82fd8fd3850] drm_fb_helper_dirty at ffffffffc1d3cea5 [drm_kms_helper]
#15 [ffffa82fd8fd3890] drm_fb_helper_sys_imageblit at ffffffffc1d3d010 [drm_kms_helper]
#16 [ffffa82fd8fd38b0] soft_cursor at ffffffff8f316906
#17 [ffffa82fd8fd3900] bit_cursor at ffffffff8f3166da
#18 [ffffa82fd8fd39d0] fbcon_cursor at ffffffff8f311e3f
#19 [ffffa82fd8fd3a18] hide_cursor at ffffffff8f3c6213
#20 [ffffa82fd8fd3a30] vt_console_print at ffffffff8f3c7e42
#21 [ffffa82fd8fd3a98] console_unlock at ffffffff8ef1d4ac
#22 [ffffa82fd8fd3ae0] vprintk_emit at ffffffff8ef1ecf5
#23 [ffffa82fd8fd3b40] vprintk_default at ffffffff8ef1ee29
#24 [ffffa82fd8fd3b50] vprintk_func at ffffffff8ef1f8a4
#25 [ffffa82fd8fd3b70] printk at ffffffff8ef1bf58
#26 [ffffa82fd8fd3bd0] report_bug at ffffffff8f7dbbc5
#27 [ffffa82fd8fd3c10] fixup_bug at ffffffff8ee3592c
#28 [ffffa82fd8fd3c30] do_error_trap at ffffffff8ee35989
#29 [ffffa82fd8fd3c78] do_invalid_op at ffffffff8ee35ac2
#30 [ffffa82fd8fd3ca0] invalid_op at ffffffff8fa013b7
[exception RIP: perf_cgroup_switch+330]
RIP: ffffffff8f01c26a RSP: ffffa82fd8fd3d58 RFLAGS: 00010086
RAX: 0000000000000002 RBX: ffff99013f7b2708 RCX: 000000000000038f
RDX: 00000000000105a0 RSI: 0000000000000000 RDI: 000000000000038f
RBP: ffffa82fd8fd3d98 R8: 000000000000016e R9: 0000000000000039
R10: 0000000000000000 R11: 0000000000000000 R12: ffff99013f7a8010
R13: 0000000000000000 R14: 0000000000000000 R15: ffff99013f7b2700
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#31 [ffffa82fd8fd3d50] perf_cgroup_switch at ffffffff8f01c223
#32 [ffffa82fd8fd3da0] __perf_event_task_sched_in at ffffffff8f01c538
#33 [ffffa82fd8fd3e18] finish_task_switch at ffffffff8eedd460
#34 [ffffa82fd8fd3e50] __schedule at ffffffff8f800794
#35 [ffffa82fd8fd3ea0] schedule_idle at ffffffff8f800ebc
#36 [ffffa82fd8fd3eb8] do_idle at ffffffff8eee794d
#37 [ffffa82fd8fd3f10] cpu_startup_entry at ffffffff8eee7bfd
#38 [ffffa82fd8fd3f28] start_secondary at ffffffff8ee6f6e9
#39 [ffffa82fd8fd3f50] secondary_startup_64 at ffffffff8ee000e6

While certainly, the perf bug which results in this warning should be
fixed, and I am working on tracking it down, it's also clear that this
sort of warning should not deadlock the kernel. My proposed change is to
enter a "printk_deferred" section at the beginning of __schedule as the
rq lock is taken, and then end it as the lock is dropped. This style of
fix has already been proposed before [1], but was NAK'd.

I think enough has changed that we should reconsider this change.

First, one reason for the NAK was that a better printk implementation
was forthcoming, which would only write to atomic-safe console
drivers ("sane" drivers as they were called in that discussion),
delegating the rest to a kthread. This implementation is still not
upstream, but in the meantime we've had several kernel releases with
this sort of issue present.

Second, another reason for the NAK was that WARN statements require
immediate printing, since the system may not be stable enough to print
the warning later. Though this is true, warnings are not meant to be
fatal, and we should hope that the system can limp along long enough to
exit a single context switch. Even if that's not possible, the printk
infrastructure contains several ways to allow the buffered messages to
reach the console even if the system crashes later:

1) the log buffer can be written to several places via kmsg_dump()
2) panic tries very hard to empty the log buffer to console drivers
3) crash dumps contain the full log buffer data, and debug tools can
readily reconstruct these records

Third, since the posting of the previous series, commit
57116ce17b04 ("workqueue: fix state-dump console deadlock") was
introduced with no objection, to address the same deadlock in a
workqueue function.

Of course, I'm not bound to this particular solution -- I'd be glad to
discuss alternatives and other ideas. Thanks for reading and
considering!

[1]: https://lore.kernel.org/lkml/20200927161130.33172-1-zhouchengming@xxxxxxxxxxxxx/


Stephen Brennan (1):
sched: Use printk_deferred during __schedule()

kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

--
2.30.2