Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Dave Chinner
Date: Fri Sep 18 2015 - 01:40:55 EST


On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >> >
> >> > I'm not seeing why that should be an issue. Sure, there's some CPU
> >> > overhead to context switching, but I don't see that it should be that
> >> > big of a deal.
> >
> > It may well change the dispatch order of enough IOs for it to be
> > significant on an IO bound device.
>
> Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
> could see the use of a workqueue maybe changing that sufficiently. But
> it still sounds a bit unlikely.
>
> And in fact, I think I have a better explanation.
>
> > In outright performance on my test machine, the difference in
> > files/s is noise. However, the consistency looks to be substantially
> > improved and the context switch rate is now running at under
> > 3,000/sec.
>
> Hmm. I don't recall seeing you mention how many context switches per
> second you had before. What is it down from?

Around 4000-4500/sec, so there's not a huge amount of context
switches going on. It's not like btrfs, where this workload
generates over 250,000 context switches/sec....

> However, I think I may have found something more interesting here.
>
> The fact is that a *normal* schedule will trigger that whole
> blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
> doesn't actually do a normal schedule at all. Those trigger a
> *preemption* schedule.

Ok, makes sense - the plug is not being flushed as we switch away,
but Chris' patch makes it do that.

> So it turns out that as far as I can see, the whole "cond_resched()"
> will not start any IO at all, and it will just be left on the thread
> plug until we schedule back to the thread.
>
> So I don't think this has anything to do with kblockd_workqueue. I
> don't think it even gets to that point.
>
> I may be missing something, but just to humor me, can you test the
> attached patch *without* Chris's patch to do explicit plugging? This
> should make cond_resched() and cond_resched_lock() run the unplugging.

Context switches go back to the 4-4500/sec range. Otherwise
behaviour and performance is indistinguishable from Chris' patch.

Cheers,

Dave.


PS: just hit another "did this just get broken in 4.3-rc1" issue - I
can't run blktrace while there's a IO load because:

$ sudo blktrace -d /dev/vdc
BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
....

[ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
[ 641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430
[ 641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 641.429066] 0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0
[ 641.430585] ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8
[ 641.432143] ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100
[ 641.433695] Call Trace:
[ 641.434196] [<ffffffff81785609>] dump_stack+0x4b/0x72
[ 641.435219] [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130
[ 641.436402] [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0
[ 641.437684] [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0
[ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
[ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
[ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
[ 641.442492] [<ffffffff8115e041>] blk_trace_setup+0x51/0x90
[ 641.443586] [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110
[ 641.444702] [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0
[ 641.446419] [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690
[ 641.447489] [<ffffffff811fd37d>] block_ioctl+0x3d/0x50
[ 641.449194] [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490
[ 641.450491] [<ffffffff811d9629>] SyS_ioctl+0x79/0x90
[ 641.451575] [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 641.453231] Mem-Info:
[ 641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0
active_file:8456 inactive_file:2070698 isolated_file:0
unevictable:1011 dirty:374605 writeback:1176 unstable:0
slab_reclaimable:1514989 slab_unreclaimable:233692
mapped:9332 shmem:2250 pagetables:1891 bounce:0
free:24305 free_pcp:1448 free_cma:0
[ 641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 641.470017] lowmem_reserve[]: 0 2960 16005 16005
[ 641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 641.479677] lowmem_reserve[]: 0 0 13044 13044
[ 641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 641.492392] lowmem_reserve[]: 0 0 0 0
[ 641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[ 641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB
[ 641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB
[ 641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 641.503019] 2082056 total pagecache pages
[ 641.503759] 0 pages in swap cache
[ 641.504403] Swap cache stats: add 0, delete 0, find 0/0
[ 641.505374] Free swap = 497976kB
[ 641.506002] Total swap = 497976kB
[ 641.506622] 4194173 pages RAM
[ 641.507171] 0 pages HighMem/MovableOnly
[ 641.507878] 92503 pages reserved

This is from:

gdb) l *(relay_open+0x68)
0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
577 return NULL;
578 if (subbuf_size > UINT_MAX / n_subbufs)
579 return NULL;
580
581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
582 if (!chan)
583 return NULL;
584
585 chan->version = RELAYFS_CHANNEL_VERSION;
586 chan->n_subbufs = n_subbufs;

and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
that fails here....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/