Re: [syzbot] [udf?] BUG: unable to handle kernel NULL pointer dereference in __writepage

From: Jan Kara
Date: Mon Jan 23 2023 - 12:18:51 EST


On Mon 23-01-23 08:36:09, Christoph Hellwig wrote:
> I looked into this and got really confused. We should never end
> up in generic_writepages if ->writepages is set, which this patch
> obviously does.
>
> Then I took a closer look at udf, and it seems to switch a_aops around
> at run time, and it seems like we're hitting just that case, and the
> patch just seems to narrow down that window.
>
> I suspect the right fix is to remove this runtime switching of aops,
> and just do conditionals inside the methods.

Interestingly for me it crashes like:

[ 338.085616] general protection fault, probably for non-canonical address 0x40
00000000002068: 0000 [#1] PREEMPT SMP PTI
[ 338.086959] CPU: 4 PID: 31292 Comm: syz-repro11 Not tainted 6.1.0-xen+ #705
[ 338.087941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1
.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[ 338.089470] RIP: 0010:bio_associate_blkg_from_css+0x31d/0x860
[ 338.092626] RSP: 0018:ffffc90003bb7958 EFLAGS: 00010202
[ 338.093274] RAX: 0000000000000001 RBX: 4000000000002030 RCX: 000000005d6692ad
[ 338.094149] RDX: 0000000092c5763f RSI: ffffffff81eb2e65 RDI: ffffffff81ec3d71
[ 338.095023] RBP: ffff888100c98cc0 R08: 0000000000000001 R09: 0000000000020022
[ 338.095953] R10: 0000000000000000 R11: ffff888108da2fe8 R12: ffffffff831db0e0
[ 338.096884] R13: ffff888100c98cc0 R14: ffffea0004692380 R15: ffffffff831da338
[ 338.097760] FS: 00007f9c59cc0700(0000) GS:ffff888fffd00000(0000) knlGS:00000
00000000000
[ 338.098755] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 338.102194] Call Trace:
[ 338.102496] <TASK>
[ 338.102757] ? bio_associate_blkg_from_css+0x2d2/0x860
[ 338.103390] bio_associate_blkg+0x68/0x130
[ 338.103955] ? bio_associate_blkg+0x9/0x130
[ 338.104538] bio_init+0x7f/0xd0
[ 338.104926] bio_alloc_bioset+0x1f5/0x320
[ 338.106364] __mpage_writepage+0x4dc/0x780
[ 338.110045] write_cache_pages+0x113/0x470
[ 338.111635] mpage_writepages+0x5b/0xb0
[ 338.112854] do_writepages+0xd3/0x1a0
[ 338.113782] filemap_fdatawrite_wbc+0x84/0xb0
[ 338.114793] __filemap_fdatawrite_range+0x58/0x80
[ 338.115374] udf_expand_file_adinicb+0xfa/0x420 [udf]
[ 338.116109] udf_file_write_iter+0x1a9/0x1d0 [udf]

which is actually inside:
bio_associate_blkg_from_css+0x31d/0x860:
__ref_is_percpu at include/linux/percpu-refcount.h:174
(inlined by) percpu_ref_get_many at include/linux/percpu-refcount.h:204
(inlined by) percpu_ref_get at include/linux/percpu-refcount.h:222
(inlined by) blkg_get at block/blk-cgroup.h:322
(inlined by) bio_associate_blkg_from_css at block/blk-cgroup.c:1938

so bdev_get_queue(bio->bi_bdev)->root_blkg is bogus (0x4000000000002030).
Likely the request_queue is already dead. Not sure how this could be caused
by any problem in UDF.

Anyway, I tend to agree with you that switching aops is hairy and we should
probably get rid of it in UDF. But this particular crash seems to be
related to something else...

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR