Re: kmod: add a sanity check on module loading

From: Luis R. Rodriguez
Date: Mon Jan 09 2017 - 15:27:37 EST


On Fri, Jan 06, 2017 at 04:53:54PM -0500, Jessica Yu wrote:
> +++ Luis R. Rodriguez [06/01/17 21:36 +0100]:
> > On Tue, Jan 03, 2017 at 10:34:53AM +1030, Rusty Russell wrote:
> > > "Luis R. Rodriguez" <mcgrof@xxxxxxxxxx> writes:
> > > > Right, out of ~350 request_module() calls (not included try requests)
> > > > only ~46 check the return value. Hence a validation check, and come to
> > > > think of it, *this* was the issue that originally had me believing
> > > > that in some places we might end up in a null deref --if those open
> > > > coded request_module() calls assume the driver is loaded there could
> > > > be many places where a NULL is inevitable.
> > >
> > > Yes, assuming success == module loade is simply a bug. I wrote
> > > try_then_request_module() to attempt to encapsulate the correct logic
> > > into a single place; maybe we need other helpers to cover (most of?) the
> > > remaining cases?
> >
> > I see...
> >
> > OK so indeed we have a few possible changes to kernel given the above:
> >
> > a) Add SmPL rule to nag about incorrect uses of request_module() which
> > never check for the return value, and fix 86% of calls (304 call sites)
> > which are buggy
> >
> > b) Add a new API call, perhaps request_module_assert() which would
> > BUG_ON() if the requested module didn't load, and change the callers
> > which do not check for the return value to this.
>
> It is probably not a good idea to panic/BUG() because a requested
> module didn't load. IMO callers should already be accounting for the
> fact that request_module() doesn't provide these guarantees. I haven't
> looked yet to see if the majority of these callers actually do the the
> responsible thing, though.

It seems proper form is hard to vet for, and the return value actually
doesn't really give us much useful information.

> > Make request_module() do the assert and changing all proper callers of
> > request_module() to a new API call which *does* let you check for the
> > return value is another option but tasteless.
> >
> > b) seems to be what you allude to, and while it may seem also of bad taste,
> > in practice it may be hard to get callers to properly check for the return
> > value. I actually just favor a) even though its more work.
> >
> > > > Granted, I agree they
> > > > should be fixed, we could add a grammar rule to start nagging at
> > > > driver developers for started, but it does beg the question also of
> > > > what a tightly knit validation for modprobe might look like, and hence
> > > > this patch and now the completed not-yet-posted alias work.
> > >
> > > I really think aliases-in-kernel is too heavy a hammer, but a warning
> > > when modprobe "succeeds" and the module still isn't found would be
> > > a Good Thing.
> >
> > OK -- such a warning can really only happen if we had alias support though.
> > So one option is to add this and alias parsing support as a debug option.
>
> Hm, I see what you're saying..
>
> To clarify the problem (if anyone was confused, as I was..): we can
> verify a module is loaded by using find_module_all() and looking at
> its state. However, find_module_all() operates on real module names,
> and we can't verify a module has successfully loaded if all we have is
> the name of the alias (eg, "fs-*" aliases in get_fs_type), because we
> have no alias->real_module_name mappings in the kernel.

Yup!

> However, in Rusty's sample get_fs_type WARN() code, we indirectly
> validated request_module()'s work by verifying that the
> file_system_type has actually registered, which is what should happen
> if a filesystem module successfully loads. So in this case, the caller
> (get_fs_type) indirectly checks if the service it requested is now
> available, which is what I *thought* callers were supposed to do in
> the first place (and we didn't need the help of aliases to do that).
> I think the main question we have to answer is, should the burden of
> validation be on the callers, or on request_module? I am currently
> leaning towards the former, but I'm still thinking.

Validation check on the caller makes sense *but* what makes this a bit hard
is as I have found, request_module() *call* can fail for some reasons
other than the module not being available on the system -- races, and inherent
design decisions (kmod concurrent). In my patch series I address kmod concurrent
limit to be more graceful, the clutch I mentioned is another addition
to help make failures be less aggressive.

Because some issues can creep up with request_module() -- checking its return
value seems desirable -- but as Rusty notes its currently only seen as
an optimization to check for the return value. Its not really clear what
the best path forward is. Here are a bit of my current thoughts:

o The debug check Rusty suggested seems fair for upstream get_fs_type() in
retropsect.

o Although callers should validate a module was loaded and that should
in theory suffice to cover most API failures on request_module(),
once an issue does creep up its rather hard to confirm where an
issue came from exactly, adding some debug code to aid review on
issues seems fair and useful.

o We should stress test the module loader further with more tests and
fix any other pending issues

If you agree with this the validation code I proposed would just be folded
under a debug Kconfig entry, that would also mean the alias stuff is kept
only under that debug Kconfig.

The kmod stress test driver and small fixes would be sent as the first series.
I'd split off the validation stuff into a separate series to make it clearer.

Not sure on why the return value for request_module() is kept then still though.

Luis