Re: [RFC 2/2] x86_64: expand kernel stack to 16K

From: Minchan Kim
Date: Fri May 30 2014 - 02:11:44 EST

Next message: Yoshihiro YUNOMAE: "[PATCH V8 1/2] serial/uart: Introduce device specific attribute group to uart_port structure"
Previous message: Stephen Rothwell: "linux-next: build failure after merge of the pinctrl tree"
In reply to: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Next in thread: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
block/blk-core.c | 2 +-
block/blk-mq.c | 2 +-
kernel/sched/core.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
- blk_flush_plug_list(plug, false);
+ blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
if (list_empty(&plug->mq_list))
trace_block_plug(q);
else if (request_count >= BLK_MAX_REQUEST_COUNT) {
- blk_flush_plug_list(plug, false);
+ blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(&rq->queuelist, &plug->mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
+ blk_schedule_flush_plug(current);
current->in_iowait = 1;
schedule();
current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
+ blk_schedule_flush_plug(current);
current->in_iowait = 1;
ret = schedule_timeout(timeout);
current->in_iowait = 0;
--
1.9.2

On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > >
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe <jaxboe@xxxxxxxxxxxx>
> > > > > Date: Sat Apr 16 13:27:55 2011 +0200
> > > > >
> > > > > block: let io_schedule() flush the plug inline
> > > > >
> > > > > Linus correctly observes that the most important dispatch cases
> > > > > are now done from kblockd, this isn't ideal for latency reasons.
> > > > > The original reason for switching dispatches out-of-line was to
> > > > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > > > flush directly in schedule() be guarded by offload to kblockd,
> > > > > we should be able to get the best of both worlds.
> > > > >
> > > > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > > and only use that from the schedule() path.
> > > > >
> > > > > Signed-off-by: Jens Axboe <jaxboe@xxxxxxxxxxxx>
> > > > >
> > > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > >
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()....
> > ....
> > > I did below hacky test to apply your idea and the result is overflow again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > > void __sched io_schedule(void)
> > > {
> > > struct rq *rq = raw_rq();
> > > + struct blk_plug *plug = current->plug;
> > >
> > > delayacct_blkio_start();
> > > atomic_inc(&rq->nr_iowait);
> > > - blk_flush_plug(current);
> > > + if (plug)
> > > + blk_flush_plug_list(plug, true);
> > > +
> > > current->in_iowait = 1;
> > > schedule();
> > > current->in_iowait = 0;
> >
> > .....
> >
> > > Depth Size Location (46 entries)
> > >
> > > 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60
> > > 1) 7192 296 get_page_from_freelist+0x886/0x920
> > > 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20
> > > 3) 6544 8 alloc_pages_current+0x10f/0x1f0
> > > 4) 6536 168 new_slab+0x2c5/0x370
> > > 5) 6368 8 __slab_alloc+0x3a9/0x501
> > > 6) 6360 80 __kmalloc+0x1cb/0x200
> > > 7) 6280 376 vring_add_indirect+0x36/0x200
> > > 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320
> > > 9) 5760 288 __virtblk_add_req+0xda/0x1b0
> > > 10) 5472 96 virtio_queue_rq+0xd3/0x1d0
> > > 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440
> > > 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40
> > > 13) 5232 96 blk_mq_insert_requests+0xdb/0x160
> > > 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140
> > > 15) 5024 112 blk_flush_plug_list+0xc7/0x220
> > > 16) 4912 128 blk_mq_make_request+0x42a/0x600
> > > 17) 4784 48 generic_make_request+0xc0/0x100
> > > 18) 4736 112 submit_bio+0x86/0x160
> > > 19) 4624 160 __swap_writepage+0x198/0x230
> > > 20) 4464 32 swap_writepage+0x42/0x90
> > > 21) 4432 320 shrink_page_list+0x676/0xa80
> > > 22) 4112 208 shrink_inactive_list+0x262/0x4e0
> > > 23) 3904 304 shrink_lruvec+0x3e1/0x6a0
> >
> > The device is supposed to be plugged here in shrink_lruvec().
> >
> > Oh, a plug can only hold 16 individual bios, and then it does a
> > synchronous flush. Hmmm - perhaps that should also defer the flush
> > to the kblockd, because if we are overrunning a plug then we've
> > already surrendered IO dispatch latency....
> >
> > So, in blk_mq_make_request(), can you do:
> >
> > if (list_empty(&plug->mq_list))
> > trace_block_plug(q);
> > else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> > - blk_flush_plug_list(plug, false);
> > + blk_flush_plug_list(plug, true);
> > trace_block_plug(q);
> > }
> > list_add_tail(&rq->queuelist, &plug->mq_list);
> >
> > To see if that defers all the swap IO to kblockd?
> >
>
> Interim report,
>
> I applied below(we need to fix io_schedule_timeout due to mempool_alloc)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bfe16d5af9f9..0c81aacec75b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1585,7 +1585,7 @@ get_rq:
> trace_block_plug(q);
> else {
> if (request_count >= BLK_MAX_REQUEST_COUNT) {
> - blk_flush_plug_list(plug, false);
> + blk_flush_plug_list(plug, true);
> trace_block_plug(q);
> }
> }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..ebca9e1f200f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
>
> delayacct_blkio_start();
> atomic_inc(&rq->nr_iowait);
> - blk_flush_plug(current);
> + blk_schedule_flush_plug(current);
> current->in_iowait = 1;
> schedule();
> current->in_iowait = 0;
> @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
>
> delayacct_blkio_start();
> atomic_inc(&rq->nr_iowait);
> - blk_flush_plug(current);
> + blk_schedule_flush_plug(current);
> current->in_iowait = 1;
> ret = schedule_timeout(timeout);
> current->in_iowait = 0;
>
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.
>
> ----- ---- --------
> 0) 6896 16 lookup_address+0x28/0x30
> 1) 6880 16 _lookup_address_cpa.isra.3+0x3b/0x40
> 2) 6864 304 __change_page_attr_set_clr+0xe0/0xb50
> 3) 6560 112 kernel_map_pages+0x6c/0x120
> 4) 6448 256 get_page_from_freelist+0x489/0x920
> 5) 6192 352 __alloc_pages_nodemask+0x5e1/0xb20
> 6) 5840 8 alloc_pages_current+0x10f/0x1f0
> 7) 5832 168 new_slab+0x35d/0x370
> 8) 5664 8 __slab_alloc+0x3a9/0x501
> 9) 5656 80 kmem_cache_alloc+0x1ac/0x1c0
> 10) 5576 296 mempool_alloc_slab+0x15/0x20
> 11) 5280 128 mempool_alloc+0x5e/0x170
> 12) 5152 96 bio_alloc_bioset+0x10b/0x1d0
> 13) 5056 48 get_swap_bio+0x30/0x90
> 14) 5008 160 __swap_writepage+0x150/0x230
> 15) 4848 32 swap_writepage+0x42/0x90
> 16) 4816 320 shrink_page_list+0x676/0xa80
> 17) 4496 208 shrink_inactive_list+0x262/0x4e0
> 18) 4288 304 shrink_lruvec+0x3e1/0x6a0
> 19) 3984 80 shrink_zone+0x3f/0x110
> 20) 3904 128 do_try_to_free_pages+0x156/0x4c0
> 21) 3776 208 try_to_free_pages+0xf7/0x1e0
> 22) 3568 352 __alloc_pages_nodemask+0x783/0xb20
> 23) 3216 8 alloc_pages_current+0x10f/0x1f0
> 24) 3208 168 new_slab+0x2c5/0x370
> 25) 3040 8 __slab_alloc+0x3a9/0x501
> 26) 3032 80 kmem_cache_alloc+0x1ac/0x1c0
> 27) 2952 296 mempool_alloc_slab+0x15/0x20
> 28) 2656 128 mempool_alloc+0x5e/0x170
> 29) 2528 96 bio_alloc_bioset+0x10b/0x1d0
> 30) 2432 48 mpage_alloc+0x38/0xa0
> 31) 2384 208 do_mpage_readpage+0x49b/0x5d0
> 32) 2176 224 mpage_readpages+0xcf/0x120
> 33) 1952 48 ext4_readpages+0x45/0x60
> 34) 1904 224 __do_page_cache_readahead+0x222/0x2d0
> 35) 1680 16 ra_submit+0x21/0x30
> 36) 1664 112 filemap_fault+0x2d7/0x4f0
> 37) 1552 144 __do_fault+0x6d/0x4c0
> 38) 1408 160 handle_mm_fault+0x1a6/0xaf0
> 39) 1248 272 __do_page_fault+0x18a/0x590
> 40) 976 16 do_page_fault+0xc/0x10
> 41) 960 208 page_fault+0x22/0x30
> 42) 752 16 clear_user+0x2e/0x40
> 43) 736 16 padzero+0x2d/0x40
> 44) 720 304 load_elf_binary+0xa47/0x1a40
> 45) 416 48 search_binary_handler+0x9c/0x1a0
> 46) 368 144 do_execve_common.isra.25+0x58d/0x700
> 47) 224 16 do_execve+0x18/0x20
> 48) 208 32 SyS_execve+0x2e/0x40
> 49) 176 176 stub_execve+0x69/0xa0
>
>
>
> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Yoshihiro YUNOMAE: "[PATCH V8 1/2] serial/uart: Introduce device specific attribute group to uart_port structure"
Previous message: Stephen Rothwell: "linux-next: build failure after merge of the pinctrl tree"
In reply to: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Next in thread: Linus Torvalds: "Re: [RFC 2/2] x86_64: expand kernel stack to 16K"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]