Re: [RFC 2/2] x86_64: expand kernel stack to 16K

From: Linus Torvalds
Date: Wed May 28 2014 - 12:09:28 EST


On Tue, May 27, 2014 at 11:53 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.

We probably have to do this at some point, but that point is not -rc7.

And quite frankly, from the backtrace, I can only say: there is some
bad shit there. The current VM stands out as a bloated pig:

> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 0) 7696 16 lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766 0d..2 1071625991us : stack_trace_call: 3) 7640 392 kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 4) 7248 256 get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766 0d..2 1071625992us : stack_trace_call: 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20

> [ 1065.604404] kworker/-5766 0d..2 1071625995us : stack_trace_call: 23) 4672 160 __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 24) 4512 32 swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 25) 4480 320 shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 26) 4160 208 shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 27) 3952 304 shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766 0d..2 1071625996us : stack_trace_call: 28) 3648 80 shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 29) 3568 128 do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 30) 3440 208 try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 31) 3232 352 __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 32) 2880 8 alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766 0d..2 1071625997us : stack_trace_call: 33) 2872 200 __page_cache_alloc+0x13f/0x160

That __alloc_pages_nodemask() thing in particular looks bad. It
actually seems not to be the usual "let's just allocate some
structures on the stack" disease, it looks more like "lots of
inlining, horrible calling conventions, and lots of random stupid
variables".

>From a quick glance at the frame usage, some of it seems to be gcc
being rather bad at stack allocation, but lots of it is just nasty
spilling around the disgusting call-sites with tons or arguments. A
_lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
"spill frame pointer", afaik).

Avoiding some inlining, and using a single flag value rather than the
collection of "bool"s would probably help. But nothing really
trivially obvious stands out.

But what *does* stand out (once again) is that we probably shouldn't
do swap-out in direct reclaim. This came up the last time we had stack
issues (XFS) too. I really do suspect that direct reclaim should only
do the kind of reclaim that does not need any IO at all.

I think we _do_ generally avoid IO in direct reclaim, but swap is
special. And not for a good reason, afaik. DaveC, remind me, I think
you said something about the swap case the last time this came up..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/