Re: [PATCH 5/5] async, kmod: warn on synchronous request_module() from async workers

From: Saravana Kannan
Date: Thu Jun 23 2022 - 01:26:34 EST


On Fri, Jan 18, 2013 at 2:12 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> >>From 4983f3b51e18d008956dd113e0ea2f252774cefc Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@xxxxxxxxxx>
> Date: Fri, 18 Jan 2013 14:05:57 -0800
>
> Synchronous requet_module() from an async worker can lead to deadlock
> because module init path may invoke async_synchronize_full(). The
> async worker waits for request_module() to complete and the module
> loading waits for the async task to finish. This bug happened in the
> block layer because of default elevator auto-loading.
>
> Block layer has been updated not to do default elevator auto-loading
> and it has been decided to disallow synchronous request_module() from
> async workers.
>
> Trigger WARN_ON_ONCE() on synchronous request_module() from async
> workers.
>
> For more details, please refer to the following thread.
>
> http://thread.gmane.org/gmane.linux.kernel/1420814
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-by: Alex Riesen <raa.lkml@xxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> ---
> kernel/kmod.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index 1c317e3..ecd42b4 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -38,6 +38,7 @@
> #include <linux/suspend.h>
> #include <linux/rwsem.h>
> #include <linux/ptrace.h>
> +#include <linux/async.h>
> #include <asm/uaccess.h>
>
> #include <trace/events/module.h>
> @@ -130,6 +131,14 @@ int __request_module(bool wait, const char *fmt, ...)
> #define MAX_KMOD_CONCURRENT 50 /* Completely arbitrary value - KAO */
> static int kmod_loop_msg;
>
> + /*
> + * We don't allow synchronous module loading from async. Module
> + * init may invoke async_synchronize_full() which will end up
> + * waiting for this task which already is waiting for the module
> + * loading to complete, leading to a deadlock.
> + */
> + WARN_ON_ONCE(wait && current_is_async());
> +

If a builtin driver does async probing even before we get to being
able to load modules, this causes a spurious warning splat.

Here's a report by Marek [1]. I tried taking a stab at not warning at
least for drivers that do async probing before the initcalls are done,
but then I got confused [2] trying to understand when is the earliest
point in the bootup that request_module() can succeed. If someone can
clarify my confusion, I can try avoiding this warning for calls to
request_module() before we can load any modules. Any other ideas for
either making this warning way less trigger happy about false
positives?

[1] - https://lore.kernel.org/lkml/d5796286-ec24-511a-5910-5673f8ea8b10@xxxxxxxxxxx/
[2] - https://lore.kernel.org/lkml/CAGETcx-MHwex8tHLB1d71MAP01-3OPDZSNCUBb3iT+BtrugJmQ@xxxxxxxxxxxxxx/

Another question (pardon my ignorance) is whether we need to
async_synchronize_full() at the end of do_init_module() or if we can
limit it to a smaller domain? Looking at this history, I see that this
call was added by Linus in this commit d6de2c80e9d7 ("async: Fix
module loading async-work regression"). Are we doing the blanket
async_synchronize_full() only because we are not keeping proper track
of the async domains? And if so, then what if we have a sync domain
per module and any uses of async_schedule*() triggered by that module
is tied to the module's async domain? Then we'd only need to sync that
module's domain and we won't hit any deadlock issues.

Grepping for async_schedule*() calls, I see only about 30 instances.
At a glance, it looks like most cases are:
1. Have a device/driver from which we can find the related module and
tie the async_scheduler() to that domain.
2. Just direct async_schedule*() calls from module_init() -- we can
just directly tie it to the module's domain.
3. Other?

Is this idea worth pursuing? Or am I going in a completely wrong direction?

Btw, I did see Linus's suggestion in one of the emails in this thread
(?) about just doing a synchronize full on device open. That'd seem
like it would work too, but I'm afraid to touch any file open code
path because I expect that to be a hot path.

-Saravana