circular locking splat in fs/proc/vmcore.c

From: Sven Schnelle
Date: Mon Feb 14 2022 - 10:22:38 EST


Hi David,

i've seen the following lockdep splat in CI on one of our systems:

[ 25.964518] kdump[727]: saving vmcore-dmesg.txt complete
[ 26.049877]
[ 26.049879] ======================================================
[ 26.049881] WARNING: possible circular locking dependency detected
[ 26.049883] 5.17.0-20220211.rc3.git2.2636bbc7cadf.300.fc35.s390x+debug #1 Tainted: G W
[ 26.049885] ------------------------------------------------------
[ 26.049886] makedumpfile/730 is trying to acquire lock:
[ 26.049887] 0000000001a25720 (vmcore_cb_rwsem){.+.+}-{3:3}, at: mmap_vmcore+0x148/0x458
[ 26.049896]
[ 26.049896] but task is already holding lock:
[ 26.049897] 0000000013539d28 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x8e/0x170
[ 26.049904]
[ 26.049904] which lock already depends on the new lock.
[ 26.049904]
[ 26.049906]
[ 26.049906] the existing dependency chain (in reverse order) is:
[ 26.049907]
[ 26.049907] -> #1 (&mm->mmap_lock){++++}-{3:3}:
[ 26.049910] __lock_acquire+0x604/0xbd8
[ 26.049914] lock_acquire.part.0+0xe2/0x250
[ 26.049916] lock_acquire+0xb0/0x200
[ 26.049918] __might_fault+0x70/0xa0
[ 26.049921] copy_to_user_real+0x8e/0xf8
[ 26.049925] copy_oldmem_page+0xc0/0x158
[ 26.049930] read_from_oldmem.part.0+0x14c/0x1b8
[ 26.049932] __read_vmcore+0x116/0x1f8
[ 26.049933] proc_reg_read+0x9a/0xf0
[ 26.049938] vfs_read+0x94/0x1a8
[ 25.973256] kdump[729]: saving vmcore
[ 26.049941] __s390x_sys_pread64+0x90/0xc8
[ 26.049958] __do_syscall+0x1da/0x208
[ 26.049963] system_call+0x82/0xb0
[ 26.049967]
[ 26.049967] -> #0 (vmcore_cb_rwsem){.+.+}-{3:3}:
[ 26.049971] check_prev_add+0xe0/0xed8
[ 26.049972] validate_chain+0x736/0xb20
[ 26.049974] __lock_acquire+0x604/0xbd8
[ 26.049976] lock_acquire.part.0+0xe2/0x250
[ 26.049978] lock_acquire+0xb0/0x200
[ 26.049980] down_read+0x5e/0x180
[ 26.049982] mmap_vmcore+0x148/0x458
[ 26.049983] proc_reg_mmap+0x8e/0xe0
[ 26.049985] mmap_region+0x412/0x668
[ 26.049988] do_mmap+0x3ec/0x4d0
[ 26.049989] vm_mmap_pgoff+0xd4/0x170
[ 26.049992] ksys_mmap_pgoff+0x1d8/0x228
[ 26.049994] __s390x_sys_old_mmap+0xa4/0xb8
[ 26.049995] __do_syscall+0x1da/0x208
[ 26.049997] system_call+0x82/0xb0
[ 26.049999]
[ 26.049999] other info that might help us debug this:
[ 26.049999]
[ 26.050001] Possible unsafe locking scenario:
[ 26.050001]
[ 26.050002] CPU0 CPU1
[ 26.050003] ---- ----
[ 26.050004] lock(&mm->mmap_lock);
[ 26.050006] lock(vmcore_cb_rwsem);
[ 26.050008] lock(&mm->mmap_lock);
[ 26.050010] lock(vmcore_cb_rwsem);
[ 26.050012]
[ 26.050012] *** DEADLOCK ***
[ 26.050012]
[ 26.050013] 1 lock held by makedumpfile/730:
[ 26.050015] #0: 0000000013539d28 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x8e/0x170

I think this was introduced with cc5f2704c934 ("proc/vmcore: convert
oldmem_pfn_is_ram callback to more generic vmcore callbacks")

One fix might be to move the vmcore_cb_rwsem into the loop around the
pfn_is_ram(). But this would likely slow down things. So the diff would
look like: (UNTESTED)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 702754dd1daf..4acd91507d21 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -133,6 +133,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
unsigned long pfn, offset;
size_t nr_bytes;
ssize_t read = 0, tmp;
+ int is_ram;

if (!count)
return 0;
@@ -140,7 +141,6 @@ ssize_t read_from_oldmem(char *buf, size_t count,
offset = (unsigned long)(*ppos % PAGE_SIZE);
pfn = (unsigned long)(*ppos / PAGE_SIZE);

- down_read(&vmcore_cb_rwsem);
do {
if (count > (PAGE_SIZE - offset))
nr_bytes = PAGE_SIZE - offset;
@@ -148,7 +148,10 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;

/* If pfn is not ram, return zeros for sparse dump files */
- if (!pfn_is_ram(pfn)) {
+ down_read(&vmcore_cb_rwsem);
+ is_ram = pfn_is_ram(pfn);
+ up_read(&vmcore_cb_rwsem);
+ if (!is_ram) {
tmp = 0;
if (!userbuf)
memset(buf, 0, nr_bytes);
@@ -164,10 +167,8 @@ ssize_t read_from_oldmem(char *buf, size_t count,
tmp = copy_oldmem_page(pfn, buf, nr_bytes,
offset, userbuf);
}
- if (tmp < 0) {
- up_read(&vmcore_cb_rwsem);
+ if (tmp < 0)
return tmp;
- }

*ppos += nr_bytes;
count -= nr_bytes;
@@ -177,7 +178,6 @@ ssize_t read_from_oldmem(char *buf, size_t count,
offset = 0;
} while (count);

- up_read(&vmcore_cb_rwsem);
return read;
}

I think we could also switch the list to an rcu protected list, but i
don't know the code really. Any opinions how to fix this?

Thanks
Sven