Re: [syzbot] general protection fault in wb_timer_fn

From: Sven Schnelle
Date: Thu Aug 19 2021 - 05:04:01 EST


Christoph Hellwig <hch@xxxxxx> writes:

> On Mon, Aug 16, 2021 at 05:10:41PM +0800, Hillf Danton wrote:
>> Remove and free all qos callbacks added, with cb->timer deleted in
>> blk_stat_remove_callback().
>>
>> only for thoughts.
>>
>> +++ x/block/blk-sysfs.c
>> @@ -800,9 +800,7 @@ static void blk_release_queue(struct kob
>>
>> might_sleep();
>>
>> - if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
>> - blk_stat_remove_callback(q, q->poll_cb);
>> - blk_stat_free_callback(q->poll_cb);
>> + rq_qos_exit(q);
>
> rq_qos_exit is already called in blk_cleanup_queue, and the blk-mq
> pollig doesn't even use the qos framework. So I'm not sure what this
> is supposed to help.

I'm seeing a similar crash in our CI:

[ 464.072042] nbd0: detected capacity change from 0 to 2097152
[ 464.092297] nbd0: p1
[ 464.244242] EXT4-fs (nbd0p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[ 468.266306] block nbd0: NBD_DISCONNECT
[ 468.266318] block nbd0: Disconnected due to user request.
[ 468.266320] block nbd0: shutting down sockets
[ 468.291814] Unable to handle kernel pointer dereference in virtual kernel address space
[ 468.291817] Failing address: 000002aa264a7000 TEID: 000002aa264a7803
[ 468.291819] Fault in home space mode while using kernel ASCE.
[ 468.291822] AS:0000000159c84007 R3:0000000000000024
[ 468.291843] Oops: 003b ilc:3 [#1] SMP
[ 468.291846] Modules linked in: nbd(E-) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) xt_tcpudp(E) nft_compat(E) nf_nat_tftp(E) nft_objref(E) nf_conntrack_tftp(E) nft_counter(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) dm_service_time(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) nf_tables(E) nfnetlink(E) sunrpc(E) zfcp(E) scsi_transport_fc(E) dm_multipath(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) zcrypt_cex4(E) eadm_sch(E) sch_fq_codel(E) configfs(E) ip_tables(E) x_tables(E) ghash_s390(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha256_s390(E) sha1_s390(E) sha_common(E) mlx5_core(E) nvme(E) nvme_core(E) pkey(E) zcrypt(E) rng_core(E) autofs4(E)
[ 468.291891] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G E 5.14.0-20210819.rc6.git0.f26c3abc432a.300.fc34.s390x+next #1
[ 468.291894] Hardware name: IBM 8561 T01 703 (LPAR)
[ 468.291895] Krnl PSW : 0704c00180000000 0000000158cfe3b6 (wb_timer_fn+0x56/0x538)
[ 468.291902] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 468.291905] Krnl GPRS: 0000000000000200 000002aa264a7018 0000000189fc3400 0000000000000000
[ 468.291907] fffffffffffc0000 0000000000000000 00000002f767c000 0000000158cc9420
[ 468.291909] 0000000000000000 0000000189fc3410 00000001e19622a0 0000000138e9a700
[ 468.291911] 0000000080378000 00000002f767c002 0000038000d43ca0 0000038000d43c40
[ 468.291937] Krnl Code: 0000000158cfe3a4: e380b0280004 lg %r8,40(%r11)
0000000158cfe3aa: e31010900004 lg %r1,144(%r1)
#0000000158cfe3b0: e31012000004 lg %r1,512(%r1)
>0000000158cfe3b6: e36010980004 lg %r6,152(%r1)
0000000158cfe3bc: ec88005e007c cgij %r8,0,8,0000000158cfe478
0000000158cfe3c2: e310b0300002 ltg %r1,48(%r11)
0000000158cfe3c8: a7840058 brc 8,0000000158cfe478
0000000158cfe3cc: c0e5ffce8822 brasl %r14,00000001586cf410
[ 468.291951] Call Trace:
[ 468.291953] [<0000000158cfe3b6>] wb_timer_fn+0x56/0x538
[ 468.291956] [<00000001586ca980>] call_timer_fn+0x38/0x178
[ 468.291960] [<00000001586cad58>] __run_timers.part.0+0x298/0x358
[ 468.291962] [<00000001586cae62>] run_timer_softirq+0x4a/0x88
[ 468.291964] [<0000000159149236>] __do_softirq+0x146/0x3c8
[ 468.291967] [<000000015862cbaa>] irq_exit+0xf2/0x120
[ 468.291970] [<000000015913a334>] do_ext_irq+0xd4/0x160
[ 468.291972] [<000000015914769c>] ext_int_handler+0xdc/0x110
[ 468.291974] [<0000000159147826>] psw_idle_exit+0x0/0xa
[ 468.291976] ([<00000001585dbfe8>] arch_cpu_idle+0x40/0xd0)
[ 468.291978] [<000000015914718a>] default_idle_call+0x42/0x108
[ 468.291980] [<000000015866ab6a>] do_idle+0xd2/0x160
[ 468.291983] [<000000015866adb6>] cpu_startup_entry+0x36/0x40
[ 468.291985] [<00000001585ef74e>] smp_start_secondary+0x86/0x90
[ 468.291987] Last Breaking-Event-Address:
[ 468.291989] [<0000038000d43d30>] 0x38000d43d30
[ 468.291992] Kernel panic - not syncing: Fatal exception in interrupt

The crash is likely triggered by nbd. wb_timer_fn+0x56 is block/blk-wbt.c: 237
like in the syzbot reported crash. That line was just recently touched,
so i wonder whether that's related?