Re: bio linked list corruption.

From: Chris Mason
Date: Wed Oct 26 2016 - 19:15:29 EST


On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@xxxxxx> wrote:

Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 2759.125077] ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
[ 2759.125077] ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
[ 2759.127444] ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
[ 2759.127444] Call Trace:
[ 2759.127444] [<ffffffff814fe4ff>] dump_stack+0x53/0x74
[ 2759.127444] [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
[ 2759.127444] [<ffffffff81064bbf>] __warn+0xff/0x120
[ 2759.127444] [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
[ 2759.127444] [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
[ 2759.127444] [<ffffffff814df338>] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

Agreed. I'd feel better if I could trigger this outside of btrfs, even though Dave Chinner hit something very similar on xfs. I'll peel off another test machine for a long XFS run.


And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(&rq->queuelist, &ctx->rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like "&ctx->rq_list" might be corrupt.

I'm running your current git here, so these line numbers should line up for you:

blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
(inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
(inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its not surprising that we're not in the plugging path. I'm here:

if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The
* latter allows for merging opportunities and more efficient
* dispatching.
*/

I'll try the debugging patch you sent in the other email.

-chris