Re: [PATCH v3 4/4] kdb: Don't back trace on a cpu that didn't round up

From: Doug Anderson
Date: Wed Nov 07 2018 - 13:44:49 EST


Hi,

On Wed, Nov 7, 2018 at 4:30 AM Daniel Thompson
<daniel.thompson@xxxxxxxxxx> wrote:
>
> On Tue, Nov 06, 2018 at 05:00:28PM -0800, Douglas Anderson wrote:
> > If you have a CPU that fails to round up and then run 'btc' you'll end
> > up crashing in kdb becaue we dereferenced NULL. Let's add a check.
> > It's wise to also set the task to NULL when leaving the debugger so
> > that if we fail to round up on a later entry into the debugger we
> > won't backtrace a stale task.
> >
> > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> > ---
> >
> > Changes in v3:
> > - New for v3.
> >
> > Changes in v2: None
> >
> > kernel/debug/debug_core.c | 1 +
> > kernel/debug/kdb/kdb_bt.c | 11 ++++++++++-
> > 2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index 324cba8917f1..08851077c20a 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -587,6 +587,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> > kgdb_info[cpu].exception_state &=
> > ~(DCPU_WANT_MASTER | DCPU_IS_SLAVE);
> > kgdb_info[cpu].enter_kgdb--;
> > + kgdb_info[cpu].task = NULL;
>
> This code isn't quite right.
>
> In particular there are two exit paths from kgdb_cpu_enter() and this
> code path is for slave exit only (and master may change the next time we
> re-enter kgdb).

Ah, good point! Right that the master could be the the one that later
fails to round up.


> It also looks very odd to have an unconditional clear
> next to a decrement that takes account of debugger re-entrancy.

Would it look less odd if I write it like this?

if (kgdb_info[cpu].enter_kgdb == 0)
kgdb_info[cpu].task = NULL;

In general, though, "enter_kgdb" looks to be more of a boolean value
then a true counter. The only place that modifies "enter_kgdb" is
kgdb_cpu_enter(). It increments it upon entry to the function and
decrements it upon exit. All the callers to kgdb_cpu_enter() confirm
that "enter_kgdb" is 0 before calling it. I'll further note that
kgdb_cpu_enter() unconditionally clobbers things like
"kgdb_info[cpu].debuggerinfo" upon entry.

I could add a patch that converts "enter_kgdb" to a true boolean if
that helps, though at this point I'm getting pretty far afield of my
original purpose of trying to fix the problem lockdep yelled about.
:-P ...and I probably need to get back to my day job.

I'd also note that the other bits of code around here look pretty
unconditional to me, but my kdb/kgdb knowledge is not very strong (as
you've seen) so I could be wrong. We are unconditionally clearing
bits from the "exception_state", unconditionally turning tracing on,
unconditionally calling correct_hw_break(), etc. ...hmmmm, I suppose
it could look less odd if I moved my unconditional bit to be up a
little higher. I could put it up above to the unconditional clearing
of "exception_state" ;-P


> Note also that there is similar code in kdb_debugger.c (search for "zero
> out any offline cpu data") which should be tidied up if we decide to do
> the same clean up in a different way.
>
> I'll leave it to you whether to fix the existing code or add new code
> here but if you do add it in kgdb_cpu_enter() it must cover both return
> paths, include debuggerinfo as well, and kdb_debugger.c needs to be
> tidied up.

OK, so I _think_ the answer here is to just get rid of the code from
kdb_debugger.c and rely on my new code.

Specifically I'd rather CPUs clear their own "kgdb_info" so we don't
need to worry about races where a CPU rounds up really late at some
time when we're not expecting it. ...and once CPUs always clear its
own "kgdb_info" when exiting we don't need any special case code to
have the master try to clear state of offline CPUs. Thanks for
pointing me to this code to get rid of.

---

So the tl;dr:

1. Though I'm 99% certain that "enter_kgdb" is truly a bool that is
either 0 or 1, I won't write a patch to change this myself in the
hopes that I can wrap up this patch series. I'll add a note in my CL
description indicating that I believe things are safe because
"enter_kgdb" is really a bool.

2. I will add clearing of "debuggerinfo".

3. I will add the same clearing of "task" and "debuggerinfo" to when
the master exits. I won't try to unify the code in my patch and leave
that for a future person working on this code.

4. To make it look less odd, I'll move my clearing to above the
"kgdb_info[cpu].exception_state &=" line. It doesn't really matter
and why not make it look less odd?

5. I will remove the clearing of "debuggerinfo" and "task" from
kdb_stub() since it will no longer be needed.


...and again, thank you for all your timely and awesome reviews and
advice here. It is certainly appreciated.


-Doug