Re: [PATCH v3] kernel/module: Reschedule while waiting for modules to finish loading

From: Barret Rhoden
Date: Fri May 10 2019 - 14:41:57 EST


Hi -

On 5/2/19 1:46 PM, Prarit Bhargava wrote:
On 5/2/19 8:41 AM, Prarit Bhargava wrote:
On 5/2/19 5:48 AM, Jessica Yu wrote:
+++ Prarit Bhargava [01/05/19 17:26 -0400]:
On 4/30/19 6:22 PM, Prarit Bhargava wrote:
On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
loading the s390_trng.ko module.

Add a reschedule point to the loop that waits for modules to complete
loading.

v3: cleanup Fixes line.

Jessica, even with this additional patch there appears to be some other issues
in the module code that are causing significant delays in boot up on large
systems.

Is this limited to only s390? Or are you seeing this on other arches
as well? And is it limited to specific modules (like s390_trng)?

Other arches. We're seeing a hang on a new 192 CPU x86_64 box & the
acpi_cpufreq driver. The system is MUCH faster than any other x86_64 box I've
seen and that's likely why I'm seeing a problem.


FWIW, the logic in the original patch is correct. It's just that there's, as
Heiko discovered, some poor scheduling, etc., that is impacting the module
loading code after these changes.

I am really curious to see what these performance regressions look
like :/ Please update us when you find out more.


I sent Heiko a private v4 RFC last night with this patch (sorry for the
cut-and-paste)

diff --git a/kernel/module.c b/kernel/module.c
index 1c429d8d2d74..a4ef8628f26f 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod)
mutex_lock(&module_mutex);
old = find_module_all(mod->name, strlen(mod->name), true);
if (old != NULL) {
- if (old->state == MODULE_STATE_COMING
- || old->state == MODULE_STATE_UNFORMED) {
+ if (old->state != MODULE_STATE_LIVE) {
/* Wait in case it fails to load. */
mutex_unlock(&module_mutex);
- err = wait_event_interruptible(module_wq,
- finished_loading(mod->name));
+ err = wait_event_interruptible_timeout(module_wq,
+ finished_loading(mod->name),
+ HZ / 10000);
if (err)
goto out_unlocked;
goto again;

The original module dependency race issue is fixed simply by changing the
conditional to checking !MODULE_STATE_LIVE. This, unfortunately, exposed some
other problems within the code.

The module_wq is only run when a module fails to load. It's possible that
the time between the module's failed init() call and running module_wq
(kernel/module.c:3455) takes a while. Any thread entering the
add_unformed_module() code while the old module is unloading is put to sleep
waiting for the module_wq to execute.

On the 192 thread box I have noticed that the acpi_cpufreq module attempts
to load 392 times (that is not a typo and I am going to try to figure that
problem out after this one). This means 191 cpus are put to sleep, and one
cpu is executing the acpi_cpufreq module unload which is executing
do_init_module() and is now at

fail_free_freeinit:
kfree(freeinit);
fail:
/* Try to protect us from buggy refcounters. */
mod->state = MODULE_STATE_GOING;
synchronize_rcu();
module_put(mod);
blocking_notifier_call_chain(&module_notify_list,
MODULE_STATE_GOING, mod);
klp_module_going(mod);
ftrace_release_mod(mod);
free_module(mod);
wake_up_all(&module_wq);
return ret;
}

The 191 threads cannot schedule and the system is effectively stuck. It *does*
eventually free itself but in some cases it takes minutes to do so.

A simple fix for this is to, as I've done above, to add a timeout so that
the threads can be scheduled which allows other processes to run.

After taking a much closer look the above patch appears to be correct. I am not
seeing any boot failures associated with it anywhere. I would like to hear from
Heiko as to whether or not this works for him though.

I think I found the issue here. The original patch changed the state check to "not LIVE", which made it include GOING. However, the wake condition was not changed. That could lead to a livelock, which I experienced.

I have a patch that fixes it, which I'll send out shortly. With my change, I think you won't need any of the scheduler functions (cond_resched(), wait timeouts, etc). Those were probably just papering over the issue.

Barret