Re: [PATCH v2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

From: Waiman Long
Date: Tue Mar 30 2021 - 13:44:30 EST


On 3/30/21 6:42 AM, Daniel Thompson wrote:
On Mon, Mar 29, 2021 at 03:32:35PM -0400, Waiman Long wrote:
The handling of sysrq keys should normally be done in an user context
except when MAGIC_SYSRQ_SERIAL is set and the magic sequence is typed
in a serial console.
This seems to be a poor summary of the typical calling context for
handle_sysrq() except in the trivial case of using
/proc/sysrq-trigger.

For example on my system then the backtrace when I do sysrq-h on a USB
keyboard shows us running from a softirq handler and with interrupts
locked. Note also that the interrupt lock is present even on systems that
handle keyboard input from a kthread due to the interrupt lock in
report_input_key().
I will reword this part of the patch. I don't have a deep understanding of how the different way of keyword input work and thanks for showing me that there are other ways of getting keyboard input.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken
with interrupt disabled for the whole duration of the calls to print_*_stats()
and print_rq() which could last for the quite some time if the information dump
happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic, like
<snip>

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk() calls
don't need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be/proc/sysrq-trigger
problematic. Allocating a stack buffer of PATH_MAX bytes is not
feasible. So a compromised solution is used where a small stack buffer
is allocated for pathname. If the actual pathname is short enough, it
is copied to the stack buffer with sched_debug_lock release afterward
before printk(). Otherwise, the global group_path[] buffer will be
used with sched_debug_lock held until after printk().
Does this actually fix the problem in any circumstance except when the
sysrq is triggered using /proc/sysrq-trigger?

I have a reproducer that generates hard lockup panic when there are multiple instances of sysrq-t via /proc/sysrq-trigger. This is probably less a problem on console as I don't think we can do multiple simultaneous sysrq-t there. Anyway, my goal is to limit the amount of time that irq is disabled. Doing a printk can take a while depending on whether there are contention in the underlying locks or resources. Even if I limit the the critical sections to just those printk() that outputs cgroup path, I can still cause the panic.

Cheers,
Longman

The approach used by this patch should minimize the chance of a panic happening. However, if there are many tasks with very long cgroup paths, I suppose that panic may still happen under some extreme conditions. So I won't say this will completely fix the problem until the printk() rework that makes printk work more like printk_deferred() is merged.