Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

From: Christian Borntraeger
Date: Tue Nov 21 2017 - 05:14:55 EST




On 11/21/2017 10:50 AM, Christian Borntraeger wrote:
>
>
> On 11/21/2017 09:35 AM, Christian Borntraeger wrote:
>>
>>
>> On 11/20/2017 09:52 PM, Jens Axboe wrote:
>>> On 11/20/2017 01:49 PM, Christian Borntraeger wrote:
>>>>
>>>>
>>>> On 11/20/2017 08:42 PM, Jens Axboe wrote:
>>>>> On 11/20/2017 12:29 PM, Christian Borntraeger wrote:
>>>>>>
>>>>>>
>>>>>> On 11/20/2017 08:20 PM, Bart Van Assche wrote:
>>>>>>> On Fri, 2017-11-17 at 15:42 +0100, Christian Borntraeger wrote:
>>>>>>>> This is
>>>>>>>>
>>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1141) * are mapped to it.
>>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1142) */
>>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1143) WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
>>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1144) cpu_online(hctx->next_cpu));
>>>>>>>> 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1145)
>>>>>>>> b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1146) /*
>>>>>>>
>>>>>>> Did you really try to figure out when the code that reported the warning
>>>>>>> was introduced? I think that warning was introduced through the following
>>>>>>> commit:
>>>>>>
>>>>>> This was more a cut'n'paste to show which warning triggered since line numbers are somewhat volatile.
>>>>>>
>>>>>>>
>>>>>>> commit fd1270d5df6a005e1248e87042159a799cc4b2c9
>>>>>>> Date: Wed Apr 16 09:23:48 2014 -0600
>>>>>>>
>>>>>>> blk-mq: don't use preempt_count() to check for right CPU
>>>>>>>
>>>>>>> UP or CONFIG_PREEMPT_NONE will return 0, and what we really
>>>>>>> want to check is whether or not we are on the right CPU.
>>>>>>> So don't make PREEMPT part of this, just test the CPU in
>>>>>>> the mask directly.
>>>>>>>
>>>>>>> Anyway, I think that warning is appropriate and useful. So the next step
>>>>>>> is to figure out what work item was involved and why that work item got
>>>>>>> executed on the wrong CPU.
>>>>>>
>>>>>> It seems to be related to virtio-blk (is triggered by fio on such disks). Your comment basically
>>>>>> says: "no this is not a known issue" then :-)
>>>>>> I will try to take a dump to find out the work item
>>>>>
>>>>> blk-mq does not attempt to freeze/sync existing work if a CPU goes away,
>>>>> and we reconfigure the mappings. So I don't think the above is unexpected,
>>>>> if you are doing CPU hot unplug while running a fio job.
>>>>
>>>> I did a cpu hot plug (adding a CPU) and I started fio AFTER that.
>>>
>>> OK, that's different, we should not be triggering a warning for that.
>>> What does your machine/virtblk topology look like in terms of CPUS,
>>> nr of queues for virtblk, etc?
>>
>> FWIW, 4.11 does work, 4.12 and later is broken.
>
> In fact: 4.12 is fine, 4.12.14 is broken.


Bisect points to

1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit
commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1
Author: Christoph Hellwig <hch@xxxxxx>
Date: Mon Jun 26 12:20:57 2017 +0200

blk-mq: Create hctx for each present CPU

commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Jens Axboe <axboe@xxxxxxxxx>
Cc: Keith Busch <keith.busch@xxxxxxxxx>
Cc: linux-block@xxxxxxxxxxxxxxx
Cc: linux-nvme@xxxxxxxxxxxxxxxxxxx
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@xxxxxx
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

:040000 040000 a61cb023014a7b7a6b9f24ea04fe8ab22299e706 059ba6dc3290c74e0468937348e580cd53f963e7 M block
:040000 040000 432e719d7e738ffcddfb8fc964544d3b3e0a68f7 f4572aa21b249a851a1b604c148eea109e93b30d M include





adding Christoph FWIW, your patch triggers the following on 4.14 when doing a cpu hotplug (adding a
CPU) and then accessing a virtio-blk device.


747.652408] ------------[ cut here ]------------
[ 747.652410] WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 __blk_mq_run_hw_queue+0xd4/0x100
[ 747.652410] Modules linked in: dm_multipath
[ 747.652412] CPU: 4 PID: 2895 Comm: kworker/4:1H Tainted: G W 4.14.0+ #191
[ 747.652412] Hardware name: IBM 2964 NC9 704 (KVM/Linux)
[ 747.652414] Workqueue: kblockd blk_mq_run_work_fn
[ 747.652414] task: 0000000060680000 task.stack: 000000005ea30000
[ 747.652415] Krnl PSW : 0704f00180000000 0000000000505864 (__blk_mq_run_hw_queue+0xd4/0x100)
[ 747.652417] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[ 747.652417] Krnl GPRS: 0000000000000010 00000000000000ff 000000005cbec400 0000000000000000
[ 747.652418] 0000000063709120 0000000000000000 0000000063709500 0000000059fa44b0
[ 747.652418] 0000000059fa4480 0000000000000000 000000006370f700 0000000063709100
[ 747.652419] 000000005cbec500 0000000000970948 000000005ea33d80 000000005ea33d48
[ 747.652423] Krnl Code: 0000000000505854: ebaff0a00004 lmg %r10,%r15,160(%r15)
000000000050585a: c0f4ffe690d3 brcl 15,1d7a00
#0000000000505860: a7f40001 brc 15,505862
>0000000000505864: 581003b0 l %r1,944
0000000000505868: c01b001fff00 nilf %r1,2096896
000000000050586e: a784ffdb brc 8,505824
0000000000505872: a7f40001 brc 15,505874
0000000000505876: 9120218f tm 399(%r2),32
[ 747.652435] Call Trace:
[ 747.652435] ([<0000000063709600>] 0x63709600)
[ 747.652436] [<0000000000187bcc>] process_one_work+0x264/0x4b8
[ 747.652438] [<0000000000187e78>] worker_thread+0x58/0x4f8
[ 747.652439] [<000000000018ee94>] kthread+0x144/0x168
[ 747.652439] [<00000000008f8a62>] kernel_thread_starter+0x6/0xc
[ 747.652440] [<00000000008f8a5c>] kernel_thread_starter+0x0/0xc
[ 747.652440] Last Breaking-Event-Address:
[ 747.652441] [<0000000000505860>] __blk_mq_run_hw_queue+0xd0/0x100
[ 747.652442] ---[ end trace 4a001a80379b18ba ]---
[ 747.652450] ------------[ cut here ]------------