Re:[PATCH V11] kdb: Fix the deadlock issue in KDB debugging.

From: Liuye
Date: Wed Apr 17 2024 - 08:32:11 EST


>From: LiuYe <liu.yeC@xxxxxxx>
>
>Currently, if CONFIG_KDB_KEYBOARD is enabled, then kgdboc will
>attempt to use schedule_work() to provoke a keyboard reset when
>transitioning out of the debugger and back to normal operation.
>This can cause deadlock because schedule_work() is not NMI-safe.
>
>The stack trace below shows an example of the problem. In this
>case the master cpu is not running from NMI but it has parked
>the slave CPUs using an NMI and the parked CPUs is holding
>spinlocks needed by schedule_work().
>
>Example:
>BUG: spinlock lockup suspected on CPU#0. owner_cpu: 1
>CPU1: Call Trace:
>__schedule
>schedule
>schedule_hrtimeout_range_clock
>mutex_unlock
>ep_scan_ready_list
>schedule_hrtimeout_range
>ep_poll
>wake_up_q
>SyS_epoll_wait
>entry_SYSCALL_64_fastpath
>
>CPU0: Call Trace:
>dump_stack
>spin_dump
>do_raw_spin_lock
>_raw_spin_lock
>try_to_wake_up
>wake_up_process
>insert_work
>__queue_work
>queue_work_on
>kgdboc_post_exp_handler
>kgdb_cpu_enter
>kgdb_handle_exception
>__kgdb_notify
>kgdb_notify
>notifier_call_chain
>notify_die
>do_int3
>int3
>
>We fix the problem by using irq_work to call schedule_work()
>instead of calling it directly. This is because we cannot
>resynchronize the keyboard state from the hardirq context
>provided by irq_work. This must be done from the task context
>in order to call the input subsystem.
>
>Therefore, we have to defer the work twice. First, safely
>switch from the debug trap context (similar to NMI) to the
>hardirq, and then switch from the hardirq to the system work queue.
>
>Signed-off-by: LiuYe <liu.yeC@xxxxxxx>
>Co-developed-by: Daniel Thompson <daniel.thompson@xxxxxxxxxx>
>Signed-off-by: Daniel Thompson <daniel.thompson@xxxxxxxxxx>
>
>---
>V10 -> V11: Revert to V9
>V9 -> V10 : Add Signed-off-by of Greg KH and Andy Shevchenko, Acked-by of Daniel Thompson
>V8 -> V9: Modify call trace format and move irq_work.h before module.h
>V7 -> V8: Update the description information and comments in the code.
> : Submit this patch based on version linux-6.9-rc2.
>V6 -> V7: Add comments in the code.
>V5 -> V6: Replace with a more professional and accurate answer.
>V4 -> V5: Answer why schedule another work in the irq_work and not do the job directly.
>V3 -> V4: Add changelogs
>V2 -> V3: Add description information
>V1 -> V2: using irq_work to solve this properly.
>---
>---
> drivers/tty/serial/kgdboc.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
>index 7ce7bb164..32410fec7 100644
>--- a/drivers/tty/serial/kgdboc.c
>+++ b/drivers/tty/serial/kgdboc.c
>@@ -19,6 +19,7 @@
> #include <linux/console.h>
> #include <linux/vt_kern.h>
> #include <linux/input.h>
>+#include <linux/irq_work.h>
> #include <linux/module.h>
> #include <linux/platform_device.h>
> #include <linux/serial_core.h>
>@@ -82,6 +83,19 @@ static struct input_handler kgdboc_reset_handler = {
>
> static DEFINE_MUTEX(kgdboc_reset_mutex);
>
>+/*
>+ * This code ensures that the keyboard state, which is changed during kdb
>+ * execution, is resynchronized when we leave the debug trap. The resync
>+ * logic calls into the input subsystem to force a reset. The calls into
>+ * the input subsystem must be executed from normal task context.
>+ *
>+ * We need to trigger the resync from the debug trap, which executes in an
>+ * NMI (or similar) context. To make it safe to call into the input
>+ * subsystem we end up having use two deferred execution techniques.
>+ * Firstly, we use irq_work, which is NMI-safe, to provoke a callback from
>+ * hardirq context. Then, from the hardirq callback we use the system
>+ * workqueue to provoke the callback that actually performs the resync.
>+ */
> static void kgdboc_restore_input_helper(struct work_struct *dummy)
> {
> /*
>@@ -99,10 +113,17 @@ static void kgdboc_restore_input_helper(struct work_struct *dummy)
>
> static DECLARE_WORK(kgdboc_restore_input_work, kgdboc_restore_input_helper);
>
>+static void kgdboc_queue_restore_input_helper(struct irq_work *unused)
>+{
>+ schedule_work(&kgdboc_restore_input_work);
>+}
>+
>+static DEFINE_IRQ_WORK(kgdboc_restore_input_irq_work, kgdboc_queue_restore_input_helper);
>+
> static void kgdboc_restore_input(void)
> {
> if (likely(system_state == SYSTEM_RUNNING))
>- schedule_work(&kgdboc_restore_input_work);
>+ irq_work_queue(&kgdboc_restore_input_irq_work);
> }
>
> static int kgdboc_register_kbd(char **cptr)
>@@ -133,6 +154,7 @@ static void kgdboc_unregister_kbd(void)
> i--;
> }
> }
>+ irq_work_sync(&kgdboc_restore_input_irq_work);
> flush_work(&kgdboc_restore_input_work);
> }
> #else /* ! CONFIG_KDB_KEYBOARD */
>--
>2.25.1

What is the current status of PATCH V11? Are there any additional modifications needed?