current->journal_info got nested! (was Re: [syzbot] [xfs?] [ext4?] general protection fault in jbd2__journal_start)
From: Dave Chinner
Date: Tue Jan 30 2024 - 18:37:34 EST
On Tue, Jan 30, 2024 at 06:52:21AM -0800, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit: 861c0981648f Merge tag 'jfs-6.8-rc3' of github.com:kleikam..
> git tree: upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=13ca8d97e80000
> kernel config: https://syzkaller.appspot.com/x/.config?x=b0b9993d7d6d1990
> dashboard link: https://syzkaller.appspot.com/bug?extid=cdee56dbcdf0096ef605
> compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=104393efe80000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1393b90fe80000
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/7c6cc521298d/disk-861c0981.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/6203c94955db/vmlinux-861c0981.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/17e76e12b58c/bzImage-861c0981.xz
> mounted in repro: https://storage.googleapis.com/syzbot-assets/d31d4eed2912/mount_3.gz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+cdee56dbcdf0096ef605@xxxxxxxxxxxxxxxxxxxxxxxxx
>
> general protection fault, probably for non-canonical address 0xdffffc000a8a4829: 0000 [#1] PREEMPT SMP KASAN
> KASAN: probably user-memory-access in range [0x0000000054524148-0x000000005452414f]
> CPU: 1 PID: 5065 Comm: syz-executor260 Not tainted 6.8.0-rc2-syzkaller-00031-g861c0981648f #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
> RIP: 0010:jbd2__journal_start+0x87/0x5d0 fs/jbd2/transaction.c:496
> Code: 74 63 48 8b 1b 48 85 db 74 79 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 63 4d 8f ff 48 8b 2b 48 89 e8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 ef e8 4a 4d 8f ff 4c 39 65 00 0f 85 1a
> RSP: 0018:ffffc900043265c8 EFLAGS: 00010203
> RAX: 000000000a8a4829 RBX: ffff8880205fa3a8 RCX: ffff8880235dbb80
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801c1a6000
> RBP: 000000005452414e R08: 0000000000000c40 R09: 0000000000000001
^^^^^^^^
Hmmmm - TRAN. That's looks suspicious, I'll come back to that.
> R10: dffffc0000000000 R11: ffffed1003834871 R12: ffff88801c1a6000
> R13: dffffc0000000000 R14: 0000000000000c40 R15: 0000000000000002
> FS: 0000555556f90380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020020000 CR3: 0000000021fed000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> __ext4_journal_start_sb+0x215/0x5b0 fs/ext4/ext4_jbd2.c:112
> __ext4_journal_start fs/ext4/ext4_jbd2.h:326 [inline]
> ext4_dirty_inode+0x92/0x110 fs/ext4/inode.c:5969
> __mark_inode_dirty+0x305/0xda0 fs/fs-writeback.c:2452
> generic_update_time fs/inode.c:1905 [inline]
> inode_update_time fs/inode.c:1918 [inline]
> __file_update_time fs/inode.c:2106 [inline]
> file_update_time+0x39b/0x3e0 fs/inode.c:2136
> ext4_page_mkwrite+0x207/0xdf0 fs/ext4/inode.c:6090
> do_page_mkwrite+0x197/0x470 mm/memory.c:2966
> wp_page_shared mm/memory.c:3353 [inline]
> do_wp_page+0x20e3/0x4c80 mm/memory.c:3493
> handle_pte_fault mm/memory.c:5160 [inline]
> __handle_mm_fault+0x26a3/0x72b0 mm/memory.c:5285
> handle_mm_fault+0x27e/0x770 mm/memory.c:5450
> do_user_addr_fault arch/x86/mm/fault.c:1415 [inline]
> handle_page_fault arch/x86/mm/fault.c:1507 [inline]
> exc_page_fault+0x2ad/0x870 arch/x86/mm/fault.c:1563
> asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:570
EXt4 is triggering a BUG_ON:
handle_t *handle = journal_current_handle();
int err;
if (!journal)
return ERR_PTR(-EROFS);
if (handle) {
>>>>>>>>> J_ASSERT(handle->h_transaction->t_journal == journal);
handle->h_ref++;
return handle;
}
via a journal assert failure. It appears that current->journal_info
isn't what it is supposed to be. It's finding something with TRAN in
it, I think. I'll come back to this.
What syzbot is doing is creating a file on it's root filesystem and
write()ing 0x208e24b bytes (zeroes from an anonymous mmap() region,
I think) to it to initialise it's contents.
Then it mmap()s the ext4 file for 0xb36000 bytes and copies the raw
test filesystem image in the source code into it. It then creates a
memfd that it decompresses the data in the mapped ext4 file into and
creates a loop device that points to that memfd. It then mounts the
loop device and we get an XFS filesystem which doesn't appear to
contain any corruptions in it at all. It then runs a bulkstat pass
on the the XFS filesystem.
This is where it gets interesting. The user buffer that XFS
copies the inode data into points to a memory address inside the
range of the ext4 file that the filesystem image was copied to.
It does not overlap with the filesystem image.
Hence when XFS goes to copy the inodes into the user buffer, it
triggers write page faults on the ext4 backing file.
That's this part of the trace:
> RIP: 0010:rep_movs_alternative+0x4a/0x70 arch/x86/lib/copy_user_64.S:71
> Code: 75 f1 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 df 83 f9 08 73 e8 eb c9 <f3> a4 c3 48 89 c8 48 c1 e9 03 83 e0 07 f3 48 a5 89 c1 85 c9 75 b3
> RSP: 0018:ffffc900043270f8 EFLAGS: 00050202
> RAX: ffffffff848cda01 RBX: 0000000020020040 RCX: 0000000000000040
> RDX: 0000000000000000 RSI: ffff8880131873b0 RDI: 0000000020020000
> RBP: 1ffff92000864f26 R08: ffff8880131873ef R09: 1ffff11002630e7d
> R10: dffffc0000000000 R11: ffffed1002630e7e R12: 00000000000000c0
> R13: dffffc0000000000 R14: 000000002001ff80 R15: ffff888013187330
> copy_user_generic arch/x86/include/asm/uaccess_64.h:112 [inline]
> raw_copy_to_user arch/x86/include/asm/uaccess_64.h:133 [inline]
> _copy_to_user+0x86/0xa0 lib/usercopy.c:41
> copy_to_user include/linux/uaccess.h:191 [inline]
> xfs_bulkstat_fmt+0x4f/0x120 fs/xfs/xfs_ioctl.c:744
> xfs_bulkstat_one_int+0xd8b/0x12e0 fs/xfs/xfs_itable.c:161
> xfs_bulkstat_iwalk+0x72/0xb0 fs/xfs/xfs_itable.c:239
> xfs_iwalk_ag_recs+0x4c3/0x820 fs/xfs/xfs_iwalk.c:220
> xfs_iwalk_run_callbacks+0x25b/0x490 fs/xfs/xfs_iwalk.c:376
> xfs_iwalk_ag+0xad6/0xbd0 fs/xfs/xfs_iwalk.c:482
> xfs_iwalk+0x360/0x6f0 fs/xfs/xfs_iwalk.c:584
> xfs_bulkstat+0x4f8/0x6c0 fs/xfs/xfs_itable.c:308
> xfs_ioc_bulkstat+0x3d0/0x450 fs/xfs/xfs_ioctl.c:867
> xfs_file_ioctl+0x6a5/0x1980 fs/xfs/xfs_ioctl.c:1994
> vfs_ioctl fs/ioctl.c:51 [inline]
> __do_sys_ioctl fs/ioctl.c:871 [inline]
> __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x63/0x6b
What is interesting here is this is running in an empty XFS
transaction context so that the bulkstat operation garbage collects
all the buffers it accesses without us having to explicit cleanup -
they all get released when we cancel the transaction context at the
end of the process.
But that means copy-out is running inside a transaction context, and
that means current->journal_info contains a pointer to the current
struct xfs_trans the bulkstat operation is using.
Guess what we have as the first item in a struct xfs_trans? That's
right, it's a magic number, and that magic number is:
#define XFS_TRANS_HEADER_MAGIC 0x5452414e /* TRAN */
It should be obvious what has happened now -
current->journal_info is not null, so ext4 thinks it owns the
structure attached there and panics when it finds that it isn't an
ext4 journal handle being held there.
I don't think there are any clear rules as to how filesystems can
and can't use current->journal_info. In general, a task can't jump
from one filesystem to another inside a transaction context like
this, so there's never been a serious concern about nested
current->journal_info assignments like this in the past.
XFS is doing nothing wrong - we're allowed to define transaction
contexts however we want and use current->journal_info in this way.
However, we have to acknowledge that ext4 has also done nothing
wrong by assuming current->journal_info should below to it if it is
not null. Indeed, XFS does the same thing.
The question here is what to do about this? The obvious solution is
to have save/restore semantics in the filesystem code that
sets/clears current->journal_info, and then filesystems can also do
the necessary "recursion into same filesystem" checks they need to
ensure that they aren't nesting transactions in a way that can
deadlock.
Maybe there are other options - should filesystems even be allowed to
trigger page faults when they have set current->journal_info?
What other potential avenues are there that could cause this sort of
transaction context nesting that we haven't realised exist? Does
ext4 data=jounral have problems like this in the data copy-in path?
What other filesystems allow page faults in transaction contexts?
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx