Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Dave Chinner
Date: Fri Sep 18 2015 - 19:16:54 EST


On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
>
> Yup.
>
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.
>
> The fact that Chris' patch ends up lowering the context switches
> (because it does the unplugging directly) is also an argument for his
> approach.
>
> I just wanted to understand the oddity with kblockd_workqueue. And I
> think that's solved.
>
> > Context switches go back to the 4-4500/sec range. Otherwise
> > behaviour and performance is indistinguishable from Chris' patch.
>
> .. this was exactly what I wanted to hear. So it sounds like we have
> no odd unexplained behavior left in this area.
>
> Which is not to say that there wouldn't be room for improvement, but
> it just makes me much happier about the state of these patches to feel
> like we understand what was going on.
>
> > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > can't run blktrace while there's a IO load because:
> >
> > $ sudo blktrace -d /dev/vdc
> > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > ....
> >
> > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> >
> > gdb) l *(relay_open+0x68)
> > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > 577 return NULL;
> > 578 if (subbuf_size > UINT_MAX / n_subbufs)
> > 579 return NULL;
> > 580
> > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > 582 if (!chan)
> > 583 return NULL;
> > 584
> > 585 chan->version = RELAYFS_CHANNEL_VERSION;
> > 586 chan->n_subbufs = n_subbufs;
> >
> > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > that fails here....
>
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.

Yes, I always build MAX_SMP kernels for testing, because XFS is
often used on such machines and so I want to find issues exactly
like this in my testing rather than on customer machines... :/

> That said, having that
>
> struct rchan_buf *buf[NR_CPUS];
>
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.

*nod*. But it doesn't fix the problem of the memory allocation
failing when there's still gigabytes of immediately reclaimable
memory available in the page cache. If this is failing under page
cache memory pressure, then we're going to be doing an awful lot
more falling back to vmalloc in the filesystem code where large
allocations like this are done e.g. extended attribute buffers are
order-5, and used a lot when doing things like backups which tend to
also produce significant page cache memory pressure.

Hence I'm tending towards there being a memory reclaim behaviour
regression, not so much worrying about whether this specific
allocation is optimal or not.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/