Re: [PATCH] bcache: fix panic due to cache_set is null

From: Yi Li
Date: Thu Dec 03 2020 - 01:26:03 EST


> On 12/1/20 12:35 PM, Yi Li wrote:
>> sorry, This patch will cause deadlock, i will check and redo it.
>
> Can you try latest upstream kernel firstly ? Before spending more time
> on the fix.
>

This issue just happened three times (xenserver7.5 dom0 kernel) on the
same machine and cannot reproduce it now. and have not reproduce it
using the lastest uptream kernel.

> If I remember correctly, when cancel_writeback_rate_update_dwork() is
> not timed out, the cache set memory won't be freed before the
> writeback_rate_update worker terminates. It is possible that I miss
> something in the code, but I suggest to test with a kernel after v5.3,
> and better a v5.8+ kernel.
>
> Coly Li
>
Thanks.

it is confused that why writeback_rate_update worker run again after
cancel_delayed_work_sync( kernel log telled).

Maybe i should dig more.

Attach some debug info as below.

Nov 30 18:00:44 host-126 kernel: [ 1697.906137] bcache:
update_writeback_rate() dc = ffff880818580000
Nov 30 18:00:44 host-126 kernel: [ 1697.906145] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:44 host-126 kernel: [ 1697.906148] bcache:
update_writeback_rate() dc->disk.flags = ffff8808185800a0
Nov 30 18:00:44 host-126 kernel: [ 1697.906150] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:45 host-126 kernel: [ 1698.578104] bcache:
update_writeback_rate() dc = ffff880839bd0000
Nov 30 18:00:45 host-126 kernel: [ 1698.578108] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:45 host-126 kernel: [ 1698.578109] bcache:
update_writeback_rate() dc->disk.flags = ffff880839bd00a0
Nov 30 18:00:45 host-126 kernel: [ 1698.578111] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:49 host-126 kernel: [ 1702.914126] bcache:
update_writeback_rate() dc = ffff880818580000
Nov 30 18:00:49 host-126 kernel: [ 1702.914130] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:49 host-126 kernel: [ 1702.914132] bcache:
update_writeback_rate() dc->disk.flags = ffff8808185800a0
Nov 30 18:00:49 host-126 kernel: [ 1702.914133] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:50 host-126 kernel: [ 1703.586182] bcache:
update_writeback_rate() dc = ffff880839bd0000
Nov 30 18:00:50 host-126 kernel: [ 1703.586188] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:50 host-126 kernel: [ 1703.586191] bcache:
update_writeback_rate() dc->disk.flags = ffff880839bd00a0
Nov 30 18:00:50 host-126 kernel: [ 1703.586193] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:54 host-126 kernel: [ 1707.922215] bcache:
update_writeback_rate() dc = ffff880818580000
Nov 30 18:00:54 host-126 kernel: [ 1707.922222] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:54 host-126 kernel: [ 1707.922225] bcache:
update_writeback_rate() dc->disk.flags = ffff8808185800a0
Nov 30 18:00:54 host-126 kernel: [ 1707.922227] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:55 host-126 kernel: [ 1708.594202] bcache:
update_writeback_rate() dc = ffff880839bd0000
Nov 30 18:00:55 host-126 kernel: [ 1708.594206] bcache:
update_writeback_rate() c = ffff8808182c0000
Nov 30 18:00:55 host-126 kernel: [ 1708.594208] bcache:
update_writeback_rate() dc->disk.flags = ffff880839bd00a0
Nov 30 18:00:55 host-126 kernel: [ 1708.594210] bcache:
update_writeback_rate() c->flags = ffff8808182c0368
Nov 30 18:00:55 host-126 kernel: [ 1709.118221] sd 0:0:1:0:
device_block, handle(0x0009)
Nov 30 18:00:58 host-126 kernel: [ 1711.368197] sd 0:0:1:0:
device_unblock and setting to running, handle(0x0009)
Nov 30 18:00:58 host-126 kernel: [ 1711.368263] sd 0:0:1:0: [sdb]
tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 30 18:00:58 host-126 kernel: [ 1711.368277] sd 0:0:1:0: [sdb]
tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Nov 30 18:00:58 host-126 kernel: [ 1711.368289] blk_update_request:
I/O error, dev sdb, sector 36160
Nov 30 18:00:58 host-126 kernel: [ 1711.368326] bcache: error on
96083de4-6b3e-4ede-81d1-44edc1a93729: journal io error, disabling
caching
Nov 30 18:00:58 host-126 kernel: [ 1711.368372] sd 0:0:1:0: [sdb]
tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 30 18:00:58 host-126 kernel: [ 1711.368387] sd 0:0:1:0: [sdb]
tag#1 CDB: Write(10) 2a 00 00 81 b0 40 00 00 08 00
Nov 30 18:00:58 host-126 kernel: [ 1711.368392] blk_update_request:
I/O error, dev sdb, sector 8499264
Nov 30 18:00:58 host-126 kernel: [ 1711.368407] bcache:
bch_count_io_errors() sdb2: IO error on writing data to cache.
Nov 30 18:00:58 host-126 kernel: [ 1711.368447] bcache:
bch_cached_dev_detach() bch_cached_dev_detach start
Nov 30 18:00:58 host-126 kernel: [ 1711.368458] bcache:
bch_cached_dev_detach() bch_cached_dev_detach end
Nov 30 18:00:58 host-126 kernel: [ 1711.368463] bcache:
conditional_stop_bcache_device() stop_when_cache_set_failed of bcache1
is "auto" and cache is dirty, stop it to avoid potential data
corruption.
Nov 30 18:00:58 host-126 kernel: [ 1711.368466] bcache:
bch_cached_dev_detach() bch_cached_dev_detach start
Nov 30 18:00:58 host-126 kernel: [ 1711.368474] bcache:
bch_cached_dev_detach() bch_cached_dev_detach end
Nov 30 18:00:58 host-126 kernel: [ 1711.368479] bcache:
conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0
is "auto" and cache is clean, keep it alive.
Nov 30 18:00:58 host-126 kernel: [ 1711.368494] Buffer I/O error on
dev bcache1, logical block 210979209, lost async page write
Nov 30 18:00:58 host-126 kernel: [ 1711.368585] bcache:
cached_dev_detach_finish() cached_dev_detach_finish start
Nov 30 18:00:58 host-126 kernel: [ 1711.368589] bcache:
cached_dev_detach_finish() dc->disk.flags = f
Nov 30 18:00:58 host-126 kernel: [ 1711.368591] bcache:
cancel_writeback_rate_update_dwork()
cancel_writeback_rate_update_dwork start
Nov 30 18:00:58 host-126 kernel: [ 1711.368594] bcache:
cancel_writeback_rate_update_dwork()
cancel_writeback_rate_update_dwork end
Nov 30 18:00:58 host-126 kernel: [ 1711.368617] bcache:
cached_dev_detach_finish() cached_dev_detach_finish start
Nov 30 18:00:58 host-126 kernel: [ 1711.369022] bcache:
cached_dev_detach_finish() Caching disabled for sda2
Nov 30 18:00:58 host-126 kernel: [ 1711.369025] bcache:
cached_dev_detach_finish() cached_dev_detach_finish end
Nov 30 18:00:58 host-126 kernel: [ 1711.369035] bcache:
cached_dev_detach_finish() dc->disk.flags = a
Nov 30 18:00:58 host-126 kernel: [ 1711.369039] bcache:
cancel_writeback_rate_update_dwork()
cancel_writeback_rate_update_dwork start
Nov 30 18:00:58 host-126 kernel: [ 1711.369042] bcache:
cancel_writeback_rate_update_dwork()
cancel_writeback_rate_update_dwork end
Nov 30 18:00:58 host-126 kernel: [ 1711.369460] bcache:
cached_dev_detach_finish() Caching disabled for sdf2
Nov 30 18:00:58 host-126 kernel: [ 1711.369464] bcache:
cached_dev_detach_finish() cached_dev_detach_finish end
Nov 30 18:00:58 host-126 kernel: [ 1711.374327] sd 0:0:1:0: [sdb]
Synchronizing SCSI cache
Nov 30 18:00:58 host-126 kernel: [ 1711.374516] sd 0:0:1:0: [sdb]
Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
Nov 30 18:00:58 host-126 kernel: [ 1711.376907] bcache:
cache_set_free() Cache set 96083de4-6b3e-4ede-81d1-44edc1a93729
unregistered
Nov 30 18:00:58 host-126 kernel: [ 1711.418979] mpt3sas_cm0: removing
handle(0x0009), sas_addr(0x4433221102000000)
Nov 30 18:00:58 host-126 kernel: [ 1711.418984] mpt3sas_cm0: removing
:enclosure logical id(0x5a4bf013514f4000), slot(4)
Nov 30 18:00:58 host-126 kernel: [ 1711.418988] mpt3sas_cm0: removing
:enclosure level(0x0000), connector name( )
Nov 30 18:01:05 host-126 kernel: [ 1718.873725] bcache:
cached_dev_free() cached_dev_free start
Nov 30 18:01:05 host-126 kernel: [ 1718.873776] bcache:
bcache_device_free() bcache1 stopped
Nov 30 18:01:05 host-126 kernel: [ 1718.926598] bcache:
cached_dev_free() cached_dev_free end
Nov 30 18:01:06 host-126 kernel: [ 1719.668540] sda: sda1 sda2
Nov 30 18:01:11 host-126 kernel: [ 1725.010305] bcache:
update_writeback_rate() dc = ffff880818580000
Nov 30 18:01:11 host-126 kernel: [ 1725.010309] bcache:
update_writeback_rate() c = (null)
------------the kernel log show that cache_set is null.
Nov 30 18:01:11 host-126 kernel: [ 1725.010311] bcache:
update_writeback_rate() dc->disk.flags = ffff8808185800a0
Nov 30 18:01:12 host-126 kernel: [ 1725.369791] mpt3sas_cm0:
detecting: handle(0x0009), sas_address(0x4433221102000000), phy(2)
Nov 30 18:01:12 host-126 kernel: [ 1725.369802] mpt3sas_cm0:
REPORT_LUNS: handle(0x0009), retries(0)
Nov 30 18:01:12 host-126 kernel: [ 1725.618138] mpt3sas_cm0:
TEST_UNIT_READY: handle(0x0009), lun(0)
Nov 30 18:01:13 host-126 kernel: [ 1726.619148] mpt3sas_cm0:
detecting: handle(0x0009), sas_address(0x4433221102000000), phy(2)
Nov 30 18:01:13 host-126 kernel: [ 1726.619156] mpt3sas_cm0:
REPORT_LUNS: handle(0x0009), retries(0)
Nov 30 18:01:13 host-126 kernel: [ 1726.619300] mpt3sas_cm0:
TEST_UNIT_READY: handle(0x0009), lun(0)
Nov 30 18:01:13 host-126 kernel: [ 1726.622323] scsi 0:0:6:0:
Direct-Access ATA INTEL SSDSC2BB48 0112 PQ: 0 ANSI: 6
Nov 30 18:01:13 host-126 kernel: [ 1726.622839] scsi 0:0:6:0: SATA:
handle(0x0009), sas_addr(0x4433221102000000), phy(2),
device_name(0x55cd2e414d6dcd4b)
Nov 30 18:01:13 host-126 kernel: [ 1726.622845] scsi 0:0:6:0: SATA:
enclosure logical id(0x5a4bf013514f4000), slot(4)
Nov 30 18:01:13 host-126 kernel: [ 1726.622847] scsi 0:0:6:0: SATA:
enclosure level(0x0000), connector name( )
Nov 30 18:01:13 host-126 kernel: [ 1726.622990] scsi 0:0:6:0:
atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Nov 30 18:01:13 host-126 kernel: [ 1726.623642] scsi 0:0:6:0:
serial_number(PHDV7102003Q480BGN )
Nov 30 18:01:13 host-126 kernel: [ 1726.626116] sd 0:0:6:0: Attached
scsi generic sg1 type 0
Nov 30 18:01:13 host-126 kernel: [ 1726.626784] sd 0:0:6:0: [sdb]
937703088 512-byte logical blocks: (480 GB/447 GiB)
Nov 30 18:01:13 host-126 kernel: [ 1726.626789] sd 0:0:6:0: [sdb]
4096-byte physical blocks
Nov 30 18:01:13 host-126 kernel: [ 1726.631161] sd 0:0:6:0: [sdb]
Write Protect is off
Nov 30 18:01:13 host-126 kernel: [ 1726.631165] sd 0:0:6:0: [sdb] Mode
Sense: 9b 00 10 08
Nov 30 18:01:13 host-126 kernel: [ 1726.632513] sd 0:0:6:0: [sdb]
Write cache: enabled, read cache: enabled, supports DPO and FUA
Nov 30 18:01:13 host-126 kernel: [ 1726.648068] sdb: sdb1 sdb2
Nov 30 18:01:13 host-126 kernel: [ 1726.656841] sd 0:0:6:0: [sdb]
Attached SCSI disk

>>
>> On 11/30/20, Yi Li <yili@xxxxxxxxxxx> wrote:
>>> bcache_device_detach will release the cache_set after hotunplug cache
>>> disk. update_writeback_rate should check validate of cache_set.
>>>
>>> IP: [<ffffffffa03730c9>] update_writeback_rate+0x59/0x3a0 [bcache]
>>> PGD 879620067 PUD 8755d3067 PMD 0
>>> Oops: 0000 [#1] SMP
>>> CPU: 8 PID: 1005702 Comm: kworker/8:0 Tainted: G 4.4.0+10 #1
>>> Hardware name: Intel BIOS SE5C610.86B.01.01.0021.032120170601
>>> 03/21/2017
>>> Workqueue: events update_writeback_rate [bcache]
>>> task: ffff8808786f3800 ti: ffff88077082c000 task.ti: ffff88077082c000
>>> RIP: e030:[<ffffffffa03730c9>] update_writeback_rate+0x59/0x3a0
>>> [bcache]
>>> RSP: e02b:ffff88077082fde0 EFLAGS: 00010202
>>> RAX: 0000000000000018 RBX: ffff8808047f0b08 RCX: 0000000000000000
>>> RDX: 0000000000000001 RSI: ffff88088170dab8 RDI: ffff88088170dab8
>>> RBP: ffff88077082fe18 R08: 000000000000000a R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000017bc8 R12: 0000000000000000
>>> R13: ffff8808047f0000 R14: 0000000000000200 R15: ffff8808047f0b08
>>> FS: 00007f157b6d6700(0000) GS:ffff880881700000(0000)
>>> knlGS:0000000000000000
>>> CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000368 CR3: 0000000875c05000 CR4: 0000000000040660
>>> Stack:
>>> 0000000000000001 0000000000007ff0 ffff88085ff600c0 ffff880881714e80
>>> ffff880881719500 0000000000000200 ffff8808047f0b08 ffff88077082fe60
>>> ffffffff81088c0c 0000000081714e80 0000000000000000 ffff880881714e80
>>> Call Trace:
>>> [<ffffffff81088c0c>] process_one_work+0x1fc/0x3b0
>>> [<ffffffff81089575>] worker_thread+0x2a5/0x470
>>> [<ffffffff815a2f58>] ? __schedule+0x648/0x870
>>> [<ffffffff810892d0>] ? rescuer_thread+0x300/0x300
>>> [<ffffffff8108e3d5>] kthread+0xd5/0xe0
>>> [<ffffffff8108e300>] ? kthread_stop+0x110/0x110
>>> [<ffffffff815a704f>] ret_from_fork+0x3f/0x70
>>> [<ffffffff8108e300>] ? kthread_stop+0x110/0x110
>>>
>>> Reported-by: Guo Chao <guochao@xxxxxxxxxxx>
>>> Signed-off-by: Guo Chao <guochao@xxxxxxxxxxx>
>>> Signed-off-by: Yi Li <yili@xxxxxxxxxxx>
>>> ---
>>> drivers/md/bcache/writeback.c | 12 +++++++++++-
>>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/bcache/writeback.c
>>> b/drivers/md/bcache/writeback.c
>>> index 3c74996978da..186c4c6e1607 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -175,7 +175,15 @@ static void update_writeback_rate(struct
>>> work_struct
>>> *work)
>>> struct cached_dev *dc = container_of(to_delayed_work(work),
>>> struct cached_dev,
>>> writeback_rate_update);
>>> - struct cache_set *c = dc->disk.c;
>>> + struct cache_set *c = NULL;
>>> +
>>> + mutex_lock(&bch_register_lock);
>>> + c = dc->disk.c;
>>> +
>>> + if (c == NULL) {
>>> + mutex_unlock(&bch_register_lock);
>>> + return;
>>> + }
>>>
>>> /*
>>> * should check BCACHE_DEV_RATE_DW_RUNNING before calling
>>> @@ -194,6 +202,7 @@ static void update_writeback_rate(struct work_struct
>>> *work)
>>> clear_bit(BCACHE_DEV_RATE_DW_RUNNING, &dc->disk.flags);
>>> /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */
>>> smp_mb__after_atomic();
>>> + mutex_unlock(&bch_register_lock);
>>> return;
>>> }
>>>
>>> @@ -230,6 +239,7 @@ static void update_writeback_rate(struct work_struct
>>> *work)
>>> clear_bit(BCACHE_DEV_RATE_DW_RUNNING, &dc->disk.flags);
>>> /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */
>>> smp_mb__after_atomic();
>>> + mutex_unlock(&bch_register_lock);
>>> }
>>>
>>> static unsigned int writeback_delay(struct cached_dev *dc,
>>> --
>>> 2.25.3
>>>
>>>
>>>
>>>
>
>