Re: [PATCH] kgdb: Resolve races during kgdb_io_register/unregister_module

From: Doug Anderson
Date: Tue Jun 30 2020 - 18:27:02 EST


Hi,

On Tue, Jun 30, 2020 at 8:05 AM Daniel Thompson
<daniel.thompson@xxxxxxxxxx> wrote:
>
> On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson
> > <daniel.thompson@xxxxxxxxxx> wrote:
> > >
> > > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()
> > > are called outside the scope of the kgdb_registration_lock. This
> > > allows them to race with each other. This could do all sorts of crazy
> > > things up to and including dbg_io_ops becoming NULL partway through the
> > > execution of the kgdb trap handler (which isn't allowed and would be
> > > fatal).
> > >
> > > Fix this by bringing the trap handler setup and teardown into the scope
> > > of the registration lock.
> > >
> > > Signed-off-by: Daniel Thompson <daniel.thompson@xxxxxxxxxx>
> > > ---
> > > kernel/debug/debug_core.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > > index 9e5934780f41..9799f2c6dc94 100644
> > > --- a/kernel/debug/debug_core.c
> > > +++ b/kernel/debug/debug_core.c
> > > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)
> > >
> > > dbg_io_ops = new_dbg_io_ops;
> > >
> > > - spin_unlock(&kgdb_registration_lock);
> > > -
> > > if (old_dbg_io_ops) {
> > > + spin_unlock(&kgdb_registration_lock);
> > > old_dbg_io_ops->deinit();
> > > return 0;
> > > }
> > > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)
> > > /* Arm KGDB now. */
> > > kgdb_register_callbacks();
> > >
> > > + spin_unlock(&kgdb_registration_lock);
> >
> > From looking at code paths, I think this is illegal, isn't it? You're
> > now calling kgdb_register_callbacks() while holding a spinlock, but:
> >
> > kgdb_register_callbacks()
> > -> register_console()
> > --> console_lock()
> > ---> might_sleep()
> > ----> <boom!>
>
> Thanks.
>
> I very nearly didn't press "Send" yesterday because I was worried I was
> rushing it too much (in order to avoid forgetting it ;-) ). Should have
> listened to myself!
>
>
> > I'm a little curious about the exact race we're trying to solve.
> > Calling unregister on an IO module before register even finished seems
> > like an error on the caller, so I guess it would be calling register
> > from a 2nd thread for a different IO module while the first thread was
> > partway through unregistering? Even that seems awfully sketchy since
> > you're risking registering a 2nd IO ops while the first is still there
> > and that's illegal enough that we do a pr_err() for it (though we
> > don't crash), but let's say we're trying to solve that one.
>
> I didn't follow all the possible paths. Utlimately the
> (un)register_callbacks() functions use a flag variable without a lock
> and that can interact in lots of different ways.
>
> To be honest none are especially likely because the normal case is to
> register once during boot and never unregister. However we can trigger
> register/unregister from userspace so I think they can happen
> in parallel.

This is for kgdboc or one of the other IO modules? I do know that, at
least for kgdboc, we have the "config_mutex". I won't promise that
there are no bugs there but in the very least it should mostly prevent
a host of these types of issues. ...so I guess you'd have to in
parallel be spamming a register of a non kgdboc IO module together
with an unregister of kgdboc?


> Double unregister can lead to some especially nasty schedules...
> although they still remain pretty unlikely since we need the double
> unregister to coincide with a breakpoint:
>
>
> kgdb_unregister_callbacks() kgdb_unregister_callbacks()
> . .
> test flag .
> set flag to 0 .
> . test flag
> . spin_lock()
> *** kgdb trap *** .
> . paranoid dbg_io_ops check .
> . dbg_io_ops = NULL
> . stop other CPUs
> . try to use NULL dbg_io_ops
>
>
> I have drawn the kgdb trap in the first column because otherwise things
> get too wide but the trap could trigger on any CPU in the system and
> provoke the problem.
>
>
> >
> > Looking at it closely, I _think_ the only race in this case is if the
> > one we're trying to unregister had a deinit() function and we going to
> > replace it? If it didn't have a deinit function:
> >
> > cpu1 (unregister) cpu2 (register):
> > ----------------- ----------------------
> > kgdb_unregister_callbacks()
> > spin_lock() <got>
> > spin_lock() <blocked>
> > if (old_dbg_io_ops) <true>
> > if (has dinit) <false>
> > print error
> > spin_unlock()
> > return -EBUSY
> > <finish unregister>
> >
> > The above is fine and is the same thing that would happen if the
> > whole register function ran before the unregister even started, right?
> >
> > Also: if the unregister won the race that should also be fine.
> >
> > So really the problem is this:
> >
> > cpu1 (unregister) cpu2 (register):
> > ----------------- ----------------------
> > kgdb_unregister_callbacks()
> > spin_lock() <got>
> > spin_lock() <blocked>
> > if (old_dbg_io_ops) <true>
> > if (has dinit) <true>
> > print Replacing
> > init new IO ops
> > spin_unlock()
> > if (old_dbg_io_ops) <true>
> > finish deinit of old
> > return true
> > WARN_ON() <hits and shouts!>
> > dbg_io_ops = NULL
> > spin_unlock()
> > if (deinit) <true>
> > double-call to deinit of old
> >
> > So in this case we'll hit a WARN_ON(), incorrectly unregister the new
> > IO ops, and call deinit twice.
>
> To be honest I was simply working on "it is racy" and "there's not a
> good reason to allow that", especially as we start to develop tools to
> bring races to the surfaces someone will yell at us about it sooner or
> later ;-).
>
> Of course, implementing it correctly would have been better...

Yeah, still wouldn't hurt to try to figure out how to make it cleaner. :-)

-Doug