Re: [syzbot] [ntfs3?] INFO: task hung in do_user_addr_fault (3)
From: Linus Torvalds
Date: Sun Jan 01 2023 - 20:55:05 EST
On Sun, Jan 1, 2023 at 4:54 PM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> > ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline]
Something holds the ni_lock, so this process has blocked on it, and
this all happens inside mmap():
> > attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919
> > ntfs_file_mmap+0x4cc/0x780 fs/ntfs3/file.c:296
> > call_mmap include/linux/fs.h:2191 [inline]
> > mmap_region+0x1022/0x1e60 mm/mmap.c:2621
> > do_mmap+0x8d9/0xf30 mm/mmap.c:1411
> > vm_mmap_pgoff+0x1e5/0x2f0 mm/util.c:520
so this code holds the mmapo_lock for writing, which is why all those
other processes are hung on getting it for reading for page faults
etc.
End result: ignore all those page fault processes, this mmap_lock ->
ni_lock explains them all, and they aren't the cause.
> > folio_wait_bit_common+0x8ca/0x1390 mm/filemap.c:1297
> > folio_lock include/linux/pagemap.h:938 [inline]
> > truncate_inode_pages_range+0xc8d/0x1650 mm/truncate.c:421
> > truncate_inode_pages mm/truncate.c:448 [inline]
> > truncate_pagecache mm/truncate.c:743 [inline]
> > truncate_setsize+0xcb/0xf0 mm/truncate.c:768
> > ntfs_truncate fs/ntfs3/file.c:395 [inline]
.. and this thread is waiting on the page lock (well, folio, same
thing), and the IO apparently isn't completing.
And that seems to be because this one is busy reading the page, and
blocked on that same ni_lock:
> > task:syz-executor394 state:D stack:24072 pid:6048 ppid:5125 flags:0x00004004
> > Call Trace:
> > <TASK>
> > ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline]
> > attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919
> > ntfs_get_block_vbo+0x374/0xd20 fs/ntfs3/inode.c:573
> > do_mpage_readpage+0x98b/0x1bb0 fs/mpage.c:208
> > mpage_read_folio+0x103/0x1d0 fs/mpage.c:379
But our debugging output looks a bit bogus:
> > Showing all locks held in the system:
> > 3 locks held by syz-executor394/5214:
> > #0: ffff88801ee04460 (sb_writers#9){.+.+}-{0:0}, at: do_sendfile+0x61c/0xfd0 fs/read_write.c:1254
> > #1: ffff888073930ca0 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_invalidate_lock_shared include/linux/fs.h:811 [inline]
> > #1: ffff888073930ca0 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_update_page+0x72/0x550 mm/filemap.c:2478
> > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline]
> > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919
It's showing 394/5214 as "holding" the lock, even though it's just
waiting for it - it's the one doing the readpage.
I think it's just because lockdep ends up adding the lock to the queue
before it actually gets the lock, so anybody pending will be shown as
"holding" it.
And the 5221 one:
> > 2 locks held by syz-executor394/5221:
> > #0: ffff88802c7bc758 (&mm->mmap_lock){++++}-{3:3}, at: mmap_write_lock_killable include/linux/mmap_lock.h:87 [inline]
> > #0: ffff88802c7bc758 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x18f/0x2f0 mm/util.c:518
> > #1: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline]
> > #1: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919
is that mmap() one, which is waiting for the ni_lock too (while
holding the mmap_sem, which is why the page faulters are all blocked).
But 5222 is is interesting, it is the truncate one, and it's waiting
for the page lock, and it really does seem to hold the ni_lock:
> > 3 locks held by syz-executor394/5222:
> > #0: ffff88801ee04460 (sb_writers#9){.+.+}-{0:0}, at: mnt_want_write+0x3b/0x80 fs/namespace.c:508
> > #1: ffff888073930b00 (&sb->s_type->i_mutex_key#14){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:756 [inline]
> > #1: ffff888073930b00 (&sb->s_type->i_mutex_key#14){+.+.}-{3:3}, at: do_truncate+0x205/0x300 fs/open.c:63
> > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline]
> > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ntfs_truncate fs/ntfs3/file.c:393 [inline]
> > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ntfs3_setattr+0x596/0xca0 fs/ntfs3/file.c:696
So I think that we have:
- ntfs_truncate() gets the ni_lock (fs/ntfs3/file.c:393)
- it then - while holding that lock - calls (on line 395):
truncate_setsize ->
truncate_pagecache ->
truncate_inode_pages ->
truncate_inode_pages_range ->
folio_lock
but that deadlocks on another process that wants to read that page,
and that needs ni_lock to do so.
So yes, it does look like a ntfs3 deadlock involving ni_lock.
Anyway, the above is just me trying to make sense of the call traces
and trying to cut out all the noise. I might have mis-read something.
Linus