Re: [PATCH v2] block: genhd: don't call probe function with major_names_lock held

From: Tetsuo Handa
Date: Sat Jun 19 2021 - 04:47:36 EST


On 2021/06/19 15:44, Greg KH wrote:
>> Note that regardless of this patch, it is up to probe function to
>> serialize module's __init function and probe function in that module
>> by using e.g. a mutex. This patch simply makes sure that module's __exit
>> function won't be called when probe function is about to be called.
>
> module init functions ARE serialized.

The __init functions between module foo and module bar are serialized.
But the __init function for module foo and the probe function for module
foo are not serialized.

> The module owner should not matter here.

The __exit functions between module foo and module bar are serialized.
But the __exit function for module bar and the probe function for module
bar are not serialized.

You can observe a deadlock via the following steps.

(1) Build loop.ko with CONFIG_BLK_DEV_LOOP_MIN_COUNT=0 and
a delay injection patch shown below.

----------
diff --git a/block/genhd.c b/block/genhd.c
index 9f8cb7beaad1..fe0360dc8c5d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -680,6 +680,11 @@ void blk_request_module(dev_t devt)
mutex_lock(&major_names_lock);
for (n = &major_names[major_to_index(major)]; *n; n = &(*n)->next) {
if ((*n)->major == major && (*n)->probe) {
+ if (!strcmp(current->comm, "bash")) {
+ pr_info("sleep injection start\n");
+ schedule_timeout_killable(10 * HZ); // Call "rmmod" here.
+ pr_info("sleep injection end\n");
+ }
(*n)->probe(devt);
mutex_unlock(&major_names_lock);
return;
----------

(2) Run the following commands from bash shell.

# modprobe loop
# mknod /dev/loop0 b 7 0
# exec 100<>/dev/loop0 & sleep 1; rmmod loop

(3) See dmesg output.

----------
[ 32.260467][ T2873] loop: module loaded
[ 32.289961][ T2880] sleep injection start
[ 42.484039][ T2880] sleep injection end
[ 42.484218][ T2880]
[ 42.484248][ T2880] ======================================================
[ 42.484381][ T2880] WARNING: possible circular locking dependency detected
[ 42.484455][ T2880] 5.13.0-rc6+ #867 Not tainted
[ 42.484508][ T2880] ------------------------------------------------------
[ 42.484579][ T2880] bash/2880 is trying to acquire lock:
[ 42.484638][ T2880] ffffffffc075b468 (loop_ctl_mutex){+.+.}-{3:3}, at: loop_probe+0x44/0x90 [loop]
[ 42.484760][ T2880]
[ 42.484760][ T2880] but task is already holding lock:
[ 42.484836][ T2880] ffffffff873ffe28 (major_names_lock){+.+.}-{3:3}, at: blk_request_module+0x1f/0x100
[ 42.484950][ T2880]
[ 42.484950][ T2880] which lock already depends on the new lock.
[ 42.484950][ T2880]
[ 42.485053][ T2880]
[ 42.485053][ T2880] the existing dependency chain (in reverse order) is:
[ 42.485144][ T2880]
[ 42.485144][ T2880] -> #1 (major_names_lock){+.+.}-{3:3}:
[ 42.485230][ T2880] lock_acquire+0xb3/0x380
[ 42.485292][ T2880] __mutex_lock+0x89/0x8f0
[ 42.485350][ T2880] mutex_lock_nested+0x16/0x20
[ 42.485410][ T2880] unregister_blkdev+0x38/0xb0
[ 42.485469][ T2880] loop_exit+0x44/0xd84 [loop]
[ 42.485534][ T2880] __x64_sys_delete_module+0x135/0x210
[ 42.485602][ T2880] do_syscall_64+0x36/0x70
[ 42.485660][ T2880] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 42.485733][ T2880]
[ 42.485733][ T2880] -> #0 (loop_ctl_mutex){+.+.}-{3:3}:
[ 42.485817][ T2880] check_prevs_add+0x16a/0x1040
[ 42.487091][ T2880] __lock_acquire+0x118b/0x1580
[ 42.488427][ T2880] lock_acquire+0xb3/0x380
[ 42.489701][ T2880] __mutex_lock+0x89/0x8f0
[ 42.490960][ T2880] mutex_lock_nested+0x16/0x20
[ 42.492374][ T2880] loop_probe+0x44/0x90 [loop]
[ 42.493756][ T2880] blk_request_module+0xa3/0x100
[ 42.495207][ T2880] blkdev_get_no_open+0x93/0xc0
[ 42.496516][ T2880] blkdev_get_by_dev+0x56/0x200
[ 42.497735][ T2880] blkdev_open+0x5d/0xa0
[ 42.498919][ T2880] do_dentry_open+0x163/0x3b0
[ 42.500157][ T2880] vfs_open+0x28/0x30
[ 42.501312][ T2880] path_openat+0x7e6/0xad0
[ 42.502443][ T2880] do_filp_open+0x9f/0x110
[ 42.503592][ T2880] do_sys_openat2+0x245/0x310
[ 42.504647][ T2880] do_sys_open+0x48/0x80
[ 42.505689][ T2880] __x64_sys_open+0x1c/0x20
[ 42.506730][ T2880] do_syscall_64+0x36/0x70
[ 42.507795][ T2880] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 42.508890][ T2880]
[ 42.508890][ T2880] other info that might help us debug this:
[ 42.508890][ T2880]
[ 42.511436][ T2880] Possible unsafe locking scenario:
[ 42.511436][ T2880]
[ 42.512303][ T2880] CPU0 CPU1
[ 42.512727][ T2880] ---- ----
[ 42.513143][ T2880] lock(major_names_lock);
[ 42.513557][ T2880] lock(loop_ctl_mutex);
[ 42.513987][ T2880] lock(major_names_lock);
[ 42.514417][ T2880] lock(loop_ctl_mutex);
[ 42.514841][ T2880]
[ 42.514841][ T2880] *** DEADLOCK ***
[ 42.514841][ T2880]
[ 42.516112][ T2880] 1 lock held by bash/2880:
[ 42.516518][ T2880] #0: ffffffff873ffe28 (major_names_lock){+.+.}-{3:3}, at: blk_request_module+0x1f/0x100
[ 42.517383][ T2880]
[ 42.517383][ T2880] stack backtrace:
[ 42.518228][ T2880] CPU: 3 PID: 2880 Comm: bash Not tainted 5.13.0-rc6+ #867
[ 42.518679][ T2880] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[ 42.519650][ T2880] Call Trace:
[ 42.520128][ T2880] dump_stack+0x76/0x95
[ 42.520608][ T2880] print_circular_bug.isra.50.cold.71+0x13c/0x141
[ 42.521105][ T2880] check_noncircular+0xfe/0x110
[ 42.521607][ T2880] check_prevs_add+0x16a/0x1040
[ 42.522106][ T2880] __lock_acquire+0x118b/0x1580
[ 42.522606][ T2880] lock_acquire+0xb3/0x380
[ 42.523244][ T2880] ? loop_probe+0x44/0x90 [loop]
[ 42.523753][ T2880] __mutex_lock+0x89/0x8f0
[ 42.524250][ T2880] ? loop_probe+0x44/0x90 [loop]
[ 42.524749][ T2880] ? loop_probe+0x44/0x90 [loop]
[ 42.525290][ T2880] ? blkdev_get_by_dev+0x200/0x200
[ 42.525790][ T2880] ? vprintk_default+0x18/0x20
[ 42.526286][ T2880] ? vprintk+0x56/0x130
[ 42.526797][ T2880] ? blkdev_get_by_dev+0x200/0x200
[ 42.527286][ T2880] mutex_lock_nested+0x16/0x20
[ 42.527768][ T2880] ? mutex_lock_nested+0x16/0x20
[ 42.528250][ T2880] loop_probe+0x44/0x90 [loop]
[ 42.528730][ T2880] blk_request_module+0xa3/0x100
[ 42.529209][ T2880] blkdev_get_no_open+0x93/0xc0
[ 42.529691][ T2880] blkdev_get_by_dev+0x56/0x200
[ 42.530176][ T2880] ? blkdev_get_by_dev+0x200/0x200
[ 42.530688][ T2880] blkdev_open+0x5d/0xa0
[ 42.531170][ T2880] do_dentry_open+0x163/0x3b0
[ 42.531656][ T2880] vfs_open+0x28/0x30
[ 42.532132][ T2880] path_openat+0x7e6/0xad0
[ 42.532612][ T2880] do_filp_open+0x9f/0x110
[ 42.533094][ T2880] ? _raw_spin_unlock+0x1d/0x30
[ 42.533582][ T2880] ? alloc_fd+0x116/0x200
[ 42.534069][ T2880] do_sys_openat2+0x245/0x310
[ 42.534566][ T2880] do_sys_open+0x48/0x80
[ 42.535015][ T2880] __x64_sys_open+0x1c/0x20
[ 42.535462][ T2880] do_syscall_64+0x36/0x70
[ 42.535913][ T2880] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 42.536364][ T2880] RIP: 0033:0x7f16f15737f0
[ 42.536797][ T2880] Code: 48 8b 15 83 76 2d 00 f7 d8 64 89 02 48 83 c8 ff c3 66 0f 1f 84 00 00 00 00 00 83 3d cd d7 2d 00 00 75 10 b8 02 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e cf 01 00 48 89 04 24
[ 42.538148][ T2880] RSP: 002b:00007ffde9e2df58 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[ 42.538671][ T2880] RAX: ffffffffffffffda RBX: 00000000021b5020 RCX: 00007f16f15737f0
[ 42.539173][ T2880] RDX: 00000000000001b6 RSI: 0000000000000042 RDI: 00000000021bd310
[ 42.539681][ T2880] RBP: 00007ffde9e2dfe0 R08: 0000000000000020 R09: 00000000021bd310
[ 42.540187][ T2880] R10: 00000000fffffff0 R11: 0000000000000246 R12: 000000000000000b
[ 42.540696][ T2880] R13: 0000000000000001 R14: 0000000000000064 R15: 0000000000000000
[ 246.772906][ T35] INFO: task bash:2880 blocked for more than 122 seconds.
[ 246.774289][ T35] Not tainted 5.13.0-rc6+ #867
[ 246.775478][ T35] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.776734][ T35] task:bash state:D stack: 0 pid: 2880 ppid: 2856 flags:0x00004000
[ 246.779550][ T35] Call Trace:
[ 246.781054][ T35] __schedule+0x271/0x9e0
[ 246.782381][ T35] schedule+0x50/0xc0
[ 246.783671][ T35] schedule_preempt_disabled+0x10/0x20
[ 246.784978][ T35] __mutex_lock+0x467/0x8f0
[ 246.786277][ T35] ? loop_probe+0x44/0x90 [loop]
[ 246.787582][ T35] ? blkdev_get_by_dev+0x200/0x200
[ 246.789093][ T35] ? blkdev_get_by_dev+0x200/0x200
[ 246.790409][ T35] mutex_lock_nested+0x16/0x20
[ 246.791700][ T35] ? mutex_lock_nested+0x16/0x20
[ 246.792992][ T35] loop_probe+0x44/0x90 [loop]
[ 246.794286][ T35] blk_request_module+0xa3/0x100
[ 246.795757][ T35] blkdev_get_no_open+0x93/0xc0
[ 246.797066][ T35] blkdev_get_by_dev+0x56/0x200
[ 246.798360][ T35] ? blkdev_get_by_dev+0x200/0x200
[ 246.799661][ T35] blkdev_open+0x5d/0xa0
[ 246.800953][ T35] do_dentry_open+0x163/0x3b0
[ 246.802258][ T35] vfs_open+0x28/0x30
[ 246.803559][ T35] path_openat+0x7e6/0xad0
[ 246.805201][ T35] do_filp_open+0x9f/0x110
[ 246.807128][ T35] ? _raw_spin_unlock+0x1d/0x30
[ 246.808461][ T35] ? alloc_fd+0x116/0x200
[ 246.809763][ T35] do_sys_openat2+0x245/0x310
[ 246.811034][ T35] do_sys_open+0x48/0x80
[ 246.812257][ T35] __x64_sys_open+0x1c/0x20
[ 246.813416][ T35] do_syscall_64+0x36/0x70
[ 246.814522][ T35] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 246.815624][ T35] RIP: 0033:0x7f16f15737f0
[ 246.816707][ T35] RSP: 002b:00007ffde9e2df58 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[ 246.817863][ T35] RAX: ffffffffffffffda RBX: 00000000021b5020 RCX: 00007f16f15737f0
[ 246.819033][ T35] RDX: 00000000000001b6 RSI: 0000000000000042 RDI: 00000000021bd310
[ 246.820059][ T35] RBP: 00007ffde9e2dfe0 R08: 0000000000000020 R09: 00000000021bd310
[ 246.820687][ T35] R10: 00000000fffffff0 R11: 0000000000000246 R12: 000000000000000b
[ 246.821292][ T35] R13: 0000000000000001 R14: 0000000000000064 R15: 0000000000000000
[ 246.821882][ T35] INFO: task rmmod:2882 blocked for more than 122 seconds.
[ 246.822456][ T35] Not tainted 5.13.0-rc6+ #867
[ 246.823011][ T35] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.823599][ T35] task:rmmod state:D stack: 0 pid: 2882 ppid: 2856 flags:0x00000000
[ 246.824799][ T35] Call Trace:
[ 246.825405][ T35] __schedule+0x271/0x9e0
[ 246.826018][ T35] schedule+0x50/0xc0
[ 246.826621][ T35] schedule_preempt_disabled+0x10/0x20
[ 246.827232][ T35] __mutex_lock+0x467/0x8f0
[ 246.827854][ T35] ? unregister_blkdev+0x38/0xb0
[ 246.828608][ T35] mutex_lock_nested+0x16/0x20
[ 246.829237][ T35] ? mutex_lock_nested+0x16/0x20
[ 246.829858][ T35] unregister_blkdev+0x38/0xb0
[ 246.830478][ T35] loop_exit+0x44/0xd84 [loop]
[ 246.831112][ T35] __x64_sys_delete_module+0x135/0x210
[ 246.831749][ T35] do_syscall_64+0x36/0x70
[ 246.832379][ T35] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 246.833019][ T35] RIP: 0033:0x7ff4990fcee7
[ 246.833655][ T35] RSP: 002b:00007ffe445ecab8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 246.834326][ T35] RAX: ffffffffffffffda RBX: 0000000000a21220 RCX: 00007ff4990fcee7
[ 246.835011][ T35] RDX: 00007ff499171a80 RSI: 0000000000000800 RDI: 0000000000a21288
[ 246.835794][ T35] RBP: 0000000000000000 R08: 00007ff4993c6060 R09: 00007ff499171a80
[ 246.836501][ T35] R10: 00007ffe445ec520 R11: 0000000000000206 R12: 00007ffe445ee840
[ 246.837206][ T35] R13: 0000000000000000 R14: 0000000000a21220 R15: 0000000000a21010
[ 246.837922][ T35] INFO: lockdep is turned off.
----------

Christoph and Desmond commented

> > And now you can all probe after it has been freed and/or the module has
> > been unloaded. The obviously correct fix is to only hold mtd_table_mutex
> > for the actually required critical section:
> >
>
> Thank you for the correction, Christoph. I hadn't thought of the
> scenario where the module is unloaded. I'll be more conscientious in the
> future.

at https://lkml.kernel.org/r/YMs3O/ce1567cf-bc94-790c-cfc0-e4e429e1a86a@xxxxxxxxx
but the fact that AB-BA deadlock is possible was forgotten. Therefore, I propose
this patch for fixing commit a160c6159d4a0cf8 ("block: add an optional probe callback
to major_names") which started calling the probe function without making sure that
the module will not go away.

>> @@ -676,14 +679,29 @@ void blk_request_module(dev_t devt)
>> {
>> unsigned int major = MAJOR(devt);
>> struct blk_major_name **n;
>> + void (*probe_fn)(dev_t devt);
>>
>> mutex_lock(&major_names_lock);
>> for (n = &major_names[major_to_index(major)]; *n; n = &(*n)->next) {
>> - if ((*n)->major == major && (*n)->probe) {
>> - (*n)->probe(devt);
>> - mutex_unlock(&major_names_lock);
>> - return;
>> - }
>> + if ((*n)->major != major || !(*n)->probe)
>> + continue;
>> + if (!try_module_get((*n)->owner))
>> + break;
>
> So you just fail? Are you sure that is ok?

After "break;", the control reaches request_module() (which waits for module
unloading to complete and then tries to load that module again).

I think failing open() for once due to racing with module unloading is acceptable
(as long as there is no leak/deadlock/oops etc.), but do we want to immediately
retry this loop after returning from request_module() ?

Also, to make sure that the __init function for module foo and the probe
function for module foo are serialized, should we also verify that
(*n)->owner && (*n)->owner->state == MODULE_STATE_LIVE (which indicates that
the __init function for that module has completed) because module_is_live()
from try_module_get() is rather weak?

----------
/* FIXME: It'd be nice to isolate modules during init, too, so they
aren't used before they (may) fail. But presently too much code
(IDE & SCSI) require entry into the module during init.*/
static inline bool module_is_live(struct module *mod)
{
return mod->state != MODULE_STATE_GOING;
}
----------

>> + /*
>> + * Assuming that unregister_blkdev() is called from module's
>> + * __exit function, a module refcount taken above allows us
>> + * to safely call probe function without major_names_lock held.
>> + */
>> + probe_fn(devt);
>> + module_put((*n)->owner);
>
> So you are trying to keep the module in memory while the probe is call,
> ok. But module removal is not an issue, you remove modules at your own
> risk. As syzbot isn't even testing this, why is this an issue?

That's a crazy comment. A module removal bug (unless forced unloading is used)
is an issue. Why not fix bugs where syzbot cannot test?