Re: [PATCH] kgdb: Fix broken handling of printk() in NMI context
From: Daniel Thompson
Date: Tue May 12 2020 - 10:25:42 EST
On Tue, May 12, 2020 at 02:18:34PM +0530, Sumit Garg wrote:
> Since commit 42a0bb3f7138 ("printk/nmi: generic solution for safe printk
> in NMI"), kgdb entry in NMI context defaults to use safe NMI printk()
I didn't see the author on Cc: nor any of the folks whose hands it
passed through. It would definitely be good to involve them in this
discussion.
> which involves CPU specific buffers and deferred printk() until exit from
> NMI context.
>
> But kgdb being a stop-the-world debugger, we don't want to defer printk()
> especially backtrace on corresponding CPUs. So instead switch to normal
> printk() mode in kgdb_cpu_enter() if entry is in NMI context.
So, firstly I should *definitely* take a mea cupla for not shouting
about this at the time (I was on Cc:... twice). Only thing I can say
confidently is that the test suite didn't yell about this and so I
didn't look at this as closely as I should have done (and that it
didn't yell is mostly because I'm still building out the test suite
coverage).
Anyhow...
This feels a little like we are smearing the printk() interception logic
across the kernel in ways that make things hard to read. If we accepted
this patch we then have, the new NMI interception logic, the old kdb
interception logic and some hacks in the kgdb trap handler to defang the
NMI interception logic and force the kdb logic to kick in.
Wouldn't it be better to migrate kdb interception logic up a couple of
levels so that it continues to function even when we are in nmi printk
mode. That way *all* the printk() interception code would end up in
one place.
Finally some clue description of how to provoke the problem would be
useful... that sort of things helps me to grow the test suite coverage.
Daniel.
>
> Signed-off-by: Sumit Garg <sumit.garg@xxxxxxxxxx>
> ---
>
> Similar change was posted earlier specific to arm64 here [1]. But after
> discussions it emerged out that this broken handling of printk() in NMI
> context should be a common problem that is relevant to other archs as well.
> So fix this handling in kgdb_cpu_enter() as there can be multiple entry
> points to kgdb in NMI context.
>
> [1] https://lkml.org/lkml/2020/4/24/328
>
> kernel/debug/debug_core.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 2b7c9b6..ab2933f 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -567,6 +567,15 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> kgdb_info[ks->cpu].enter_kgdb++;
> kgdb_info[ks->cpu].exception_state |= exception_state;
>
> + /*
> + * kgdb entry in NMI context defaults to use safe NMI printk() which
> + * involves CPU specific buffers and deferred printk() until exit from
> + * NMI context. But kgdb being a stop-the-world debugger, we don't want
> + * to defer printk(). So instead switch to normal printk() mode here.
> + */
> + if (in_nmi())
> + printk_nmi_exit();
> +
> if (exception_state == DCPU_WANT_MASTER)
> atomic_inc(&masters_in_kgdb);
> else
> @@ -635,6 +644,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> atomic_dec(&slaves_in_kgdb);
> dbg_touch_watchdogs();
> local_irq_restore(flags);
> + if (in_nmi())
> + printk_nmi_enter();
> return 0;
> }
> cpu_relax();
> @@ -772,6 +783,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> raw_spin_unlock(&dbg_master_lock);
> dbg_touch_watchdogs();
> local_irq_restore(flags);
> + if (in_nmi())
> + printk_nmi_enter();
>
> return kgdb_info[cpu].ret_state;
> }
> --
> 2.7.4
>