Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to 4ad23a976413: MD: use per-cpu counter for writes_pending

From: Dominik Brodowski
Date: Tue Aug 08 2017 - 03:11:10 EST

Next message: Elaine Zhang: "[PATCH v3 0/9] clk: rockchip: support clk for RV1108 Soc"
Previous message: Icenowy Zheng: "[PATCH v2] mmc: sunxi: fix support for new timings mode only SoCs"
In reply to: NeilBrown: "Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to 4ad23a976413: MD: use per-cpu counter for writes_pending"
Next in thread: NeilBrown: "Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to 4ad23a976413: MD: use per-cpu counter for writes_pending"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Shaouhua,

( really CC'ing Tejun now )

On Mon, Aug 07, 2017 at 09:51:03PM -0700, Shaohua Li wrote:
> On Mon, Aug 07, 2017 at 01:20:25PM +0200, Dominik Brodowski wrote:
> > Neil, Shaohua,
> >
> > following up on David R's bug message: I have observed something similar
> > on v4.12.[345] and v4.13-rc4, but not on v4.11. This is a RAID1 (on bare
> > metal partitions, /dev/sdaX and /dev/sdbY linked together). In case it
> > matters: Further upwards are cryptsetup, a DM volume group, then logical
> > volumes, and then filesystems (ext4, but also happened with xfs).
> >
> > In a tedious bisect (the bug wasn't as quickly reproducible as I would like,
> > but happened when I repeatedly created large lvs and filled them with some
> > content, while compiling kernels in parallel), I was able to track this
> > down to:
> >
> >
> > commit 4ad23a976413aa57fe5ba7a25953dc35ccca5b71
> > Author: NeilBrown <neilb@xxxxxxxx>
> > Date: Wed Mar 15 14:05:14 2017 +1100
> >
> > MD: use per-cpu counter for writes_pending
> >
> > The 'writes_pending' counter is used to determine when the
> > array is stable so that it can be marked in the superblock
> > as "Clean". Consequently it needs to be updated frequently
> > but only checked for zero occasionally. Recent changes to
> > raid5 cause the count to be updated even more often - once
> > per 4K rather than once per bio. This provided
> > justification for making the updates more efficient.
> >
> > ...
> >
> >
> > CC'ing tj@xxxxxxxxxx, as 4ad23a976413 is the first (and only?) user
> > of percpu_ref_switch_to_atomic_sync() introduced in 210f7cdcf088.
> >
> > Applying a415c0f10627 on top of 4ad23a976413 does *not* fix the issue, but
> > reverting all of a2bfc6753065, a415c0f10627 and 4ad23a976413 seems to fix
> > the issue for v4.12.5.
>
> Spent some time to check this one,

Thanks for taking a look at this issue!

> unfortunately I can't find how that patch makes rcu stall. the percpu part
> looks good to me too. Can you double check if reverting 4ad23a976413aa57
> makes the issue go away?

v4.12.5 with the three patches reverted survived many hours of testing
yesterday. Rebooted into 4ad23a976413aa57 (rebuilt with lockup detection),
and got back traces for mdX_raid1 after a few minutes. The following test
was done with plain v4.12.5, with the first stall after around ~1 minute:

> When the rcu stall happens, what the /sys/block/md/md0/array_state?

First, a "ps axf | grep 'md'" snippet:

498 ? S< 0:00 \_ [md]
1163 ? D 0:01 \_ [md0_raid1]
1172 ? D 0:07 \_ [md1_raid1]
1182 ? D 0:00 \_ [md2_raid1]
1192 ? R 1:35 \_ [md3_raid1]

(why are md[012]_raid1 continuously in "D" state?)

array_state for md0: active
array_state for md1: active-idle
array_state for md2: active-idle
array_state for md3: clean

> please also attach /proc/mdstat.

Personalities : [raid1]
md3 : active raid1 sda8[0] sdb5[1]
174981120 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

md2 : active raid1 sda7[0] sdb6[1]
174981120 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md1 : active raid1 sda6[0] sdb7[1]
174981120 blocks super 1.2 [2/2] [UU]
bitmap: 2/2 pages [8KB], 65536KB chunk

md0 : active raid1 sda5[0] sdb8[1]
174981120 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

> When you say the mdx_raid1 threads are in 'R' state, can you double check
> if the /proc/pid/stack always 0xffffffffff?

root@i3test ~ # cat /proc/1192/stack # md3_raid1
[<ffffffffffffffff>] 0xffffffffffffffff

root@i3test ~ # cat /proc/1182/stack
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

root@i3test ~ # cat /proc/1172/stack
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

root@i3test ~ # cat /proc/1163/stack
[<ffffffff8143b9ff>] percpu_ref_switch_to_atomic_sync+0x3f/0x80
[<ffffffff815d03a8>] set_in_sync+0x68/0xe0
[<ffffffff815db314>] md_check_recovery+0x284/0x4c0
[<ffffffff815cc45c>] raid1d+0x4c/0x910
[<ffffffff815d030d>] md_thread+0x12d/0x160
[<ffffffff81141bb1>] kthread+0x131/0x170
[<ffffffff81793dd7>] ret_from_fork+0x27/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

So the other md[012]_raid1 threads are waiting in
percpu_ref_switch_to_atomic_sync(). And md3_raid1 is,
(according to NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s!
[md3_raid1:1192], stuck somewhere in raid1d...

[ 92.564717] task: ffff880234442e80 task.stack: ffffc9000102c000
[ 92.564781] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[ 92.564843] RSP: 0018:ffffc9000102fdb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[ 92.564913] RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff8802321a1e00
[ 92.564995] RDX: ffffffff815cca08 RSI: 0000000000000001 RDI: ffffffff81793291
[ 92.565077] RBP: ffffc9000102fdc0 R08: 0000000000000000 R09: 0000000000000000
[ 92.565159] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802321a1e18
[ 92.565241] R13: ffff8802321a1e70 R14: ffff880234442e80 R15: 0000000000000000
[ 92.565323] FS: 0000000000000000(0000) GS:ffff880236e00000(0000) knlGS:0000000000000000
[ 92.565425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 92.565502] CR2: 00000083b78e4e30 CR3: 0000000001c23000 CR4: 00000000000006e0
[ 92.565585] Call Trace:
[ 92.565657] raid1d+0x5f8/0x910
[ 92.565729] ? retint_kernel+0x10/0x10
[ 92.565802] md_thread+0x12d/0x160
[ 92.565874] ? md_thread+0x12d/0x160
[ 92.565947] ? __wake_up_common+0x80/0x80
[ 92.566022] kthread+0x131/0x170
[ 92.566093] ? find_pers+0x70/0x70
[ 92.566165] ? __kthread_create_on_node+0x220/0x220
[ 92.566241] ret_from_fork+0x27/0x40
[ 92.566312] Code: fc 48 8b 55 08 53 48 8d 7f 18 48 89 f3 be 01 00 00 00 e8 b1 ce 9d ff 4c 89 e7 e8 19 0c 9e ff f6 c7 02 74 13 e8 ff 90 9d ff 53 9d <65> ff 0d 3e 94 87 7e 5b 41 5c 5d c3 53 9d e8 4a 93 9d ff eb eb

Thanks,
Dominik

Attachment: signature.asc
Description: PGP signature

Next message: Elaine Zhang: "[PATCH v3 0/9] clk: rockchip: support clk for RV1108 Soc"
Previous message: Icenowy Zheng: "[PATCH v2] mmc: sunxi: fix support for new timings mode only SoCs"
In reply to: NeilBrown: "Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to 4ad23a976413: MD: use per-cpu counter for writes_pending"
Next in thread: NeilBrown: "Re: [MD] Crash with 4.12+ kernel and high disk load -- bisected to 4ad23a976413: MD: use per-cpu counter for writes_pending"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]