Re: [RFC] can we use vmalloc to alloc thread stack if compaction failed

From: Joonsoo Kim
Date: Tue Aug 16 2016 - 00:13:11 EST


On Wed, Aug 10, 2016 at 04:59:39AM -0700, Andy Lutomirski wrote:
> On Sun, Jul 31, 2016 at 10:30 PM, Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> wrote:
> > On Fri, Jul 29, 2016 at 12:47:38PM -0700, Andy Lutomirski wrote:
> >> ---------- Forwarded message ----------
> >> From: "Joonsoo Kim" <iamjoonsoo.kim@xxxxxxx>
> >> Date: Jul 28, 2016 7:57 PM
> >> Subject: Re: [RFC] can we use vmalloc to alloc thread stack if compaction failed
> >> To: "Andy Lutomirski" <luto@xxxxxxxxxx>
> >> Cc: "Xishi Qiu" <qiuxishi@xxxxxxxxxx>, "Michal Hocko"
> >> <mhocko@xxxxxxxxxx>, "Tejun Heo" <tj@xxxxxxxxxx>, "Ingo Molnar"
> >> <mingo@xxxxxxxxxx>, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>, "LKML"
> >> <linux-kernel@xxxxxxxxxxxxxxx>, "Linux MM" <linux-mm@xxxxxxxxx>,
> >> "Yisheng Xie" <xieyisheng1@xxxxxxxxxx>
> >>
> >> > On Thu, Jul 28, 2016 at 08:07:51AM -0700, Andy Lutomirski wrote:
> >> > > On Thu, Jul 28, 2016 at 3:51 AM, Xishi Qiu <qiuxishi@xxxxxxxxxx> wrote:
> >> > > > On 2016/7/28 17:43, Michal Hocko wrote:
> >> > > >
> >> > > >> On Thu 28-07-16 16:45:06, Xishi Qiu wrote:
> >> > > >>> On 2016/7/28 15:58, Michal Hocko wrote:
> >> > > >>>
> >> > > >>>> On Thu 28-07-16 15:41:53, Xishi Qiu wrote:
> >> > > >>>>> On 2016/7/28 15:20, Michal Hocko wrote:
> >> > > >>>>>
> >> > > >>>>>> On Thu 28-07-16 15:08:26, Xishi Qiu wrote:
> >> > > >>>>>>> Usually THREAD_SIZE_ORDER is 2, it means we need to alloc 16kb continuous
> >> > > >>>>>>> physical memory during fork a new process.
> >> > > >>>>>>>
> >> > > >>>>>>> If the system's memory is very small, especially the smart phone, maybe there
> >> > > >>>>>>> is only 1G memory. So the free memory is very small and compaction is not
> >> > > >>>>>>> always success in slowpath(__alloc_pages_slowpath), then alloc thread stack
> >> > > >>>>>>> may be failed for memory fragment.
> >> > > >>>>>>
> >> > > >>>>>> Well, with the current implementation of the page allocator those
> >> > > >>>>>> requests will not fail in most cases. The oom killer would be invoked in
> >> > > >>>>>> order to free up some memory.
> >> > > >>>>>>
> >> > > >>>>>
> >> > > >>>>> Hi Michal,
> >> > > >>>>>
> >> > > >>>>> Yes, it success in most cases, but I did have seen this problem in some
> >> > > >>>>> stress-test.
> >> > > >>>>>
> >> > > >>>>> DMA free:470628kB, but alloc 2 order block failed during fork a new process.
> >> > > >>>>> There are so many memory fragments and the large block may be soon taken by
> >> > > >>>>> others after compact because of stress-test.
> >> > > >>>>>
> >> > > >>>>> --- dmesg messages ---
> >> > > >>>>> 07-13 08:41:51.341 <4>[309805.658142s][pid:1361,cpu5,sManagerService]sManagerService: page allocation failure: order:2, mode:0x2000d1
> >> > > >>>>
> >> > > >>>> Yes but this is __GFP_DMA allocation. I guess you have already reported
> >> > > >>>> this failure and you've been told that this is quite unexpected for the
> >> > > >>>> kernel stack allocation. It is your out-of-tree patch which just makes
> >> > > >>>> things worse because DMA restricted allocations are considered "lowmem"
> >> > > >>>> and so they do not invoke OOM killer and do not retry like regular
> >> > > >>>> GFP_KERNEL allocations.
> >> > > >>>
> >> > > >>> Hi Michal,
> >> > > >>>
> >> > > >>> Yes, we add GFP_DMA, but I don't think this is the key for the problem.
> >> > > >>
> >> > > >> You are restricting the allocation request to a single zone which is
> >> > > >> definitely not good. Look at how many larger order pages are available
> >> > > >> in the Normal zone.
> >> > > >>
> >> > > >>> If we do oom-killer, maybe we will get a large block later, but there
> >> > > >>> is enough free memory before oom(although most of them are fragments).
> >> > > >>
> >> > > >> Killing a task is of course the last resort action. It would give you
> >> > > >> larger order blocks used for the victims thread.
> >> > > >>
> >> > > >>> I wonder if we can alloc success without kill any process in this situation.
> >> > > >>
> >> > > >> Sure it would be preferable to compact that memory but that might be
> >> > > >> hard with your restriction in place. Consider that DMA zone would tend
> >> > > >> to be less movable than normal zones as users would have to pin it for
> >> > > >> DMA. Your DMA is really large so this might turn out to just happen to
> >> > > >> work but note that the primary problem here is that you put a zone
> >> > > >> restriction for your allocations.
> >> > > >>
> >> > > >>> Maybe use vmalloc is a good way, but I don't know the influence.
> >> > > >>
> >> > > >> You can have a look at vmalloc patches posted by Andy. They are not that
> >> > > >> trivial.
> >> > > >>
> >> > > >
> >> > > > Hi Michal,
> >> > > >
> >> > > > Thank you for your comment, could you give me the link?
> >> > > >
> >> > >
> >> > > I've been keeping it mostly up to date in this branch:
> >> > >
> >> > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/vmap_stack
> >> > >
> >> > > It's currently out of sync due to a bunch of the patches being queued
> >> > > elsewhere for the merge window.
> >> >
> >> > Hello, Andy.
> >> >
> >> > I have some questions about it.
> >> >
> >> > IIUC, to turn on HAVE_ARCH_VMAP_STACK on different architecture, there
> >> > is nothing to be done in architecture side if the architecture doesn't
> >> > support lazily faults in top-level paging entries for the vmalloc
> >> > area. Is my understanding is correct?
> >> >
> >>
> >> There should be nothing fundamental that needs to be done. On the
> >> other hand, it might be good to make sure the arch code can print a
> >> clean stack trace on stack overflow.
> >>
> >> If it's helpful, I just pushed out anew
> >
> > You mean that you can turn on HAVE_ARCH_VMAP_STACK on the other arch? It
> > would be helpful. :)
> >
> >>
> >> > And, I'd like to know how you search problematic places using kernel
> >> > stack for DMA.
> >> >
> >>
> >> I did some searching for problematic sg_init_buf calls using
> >> Coccinelle. I'm not very good at Coccinelle, so I may have missed
> >> something.
> >
> > I'm also not familiar with Coccinelle. Could you share your .cocci
> > script? I can think of following one but there would be a better way.
> >
> > virtual report
> >
> > @stack_var depends on report@
> > type T1;
> > expression E1, E2;
> > identifier I1;
> > @@
> > (
> > * T1 I1;
> > )
> > ...
> > (
> > * sg_init_one(E1, &I1, E2)
> > |
> > * sg_set_buf(E1, &I1, E2)
> > )
> >
> > @stack_arr depends on report@
> > type T1;
> > expression E1, E2, E3;
> > identifier I1;
> > @@
> > (
> > * T1 I1[E1];
> > )
> > ...
> > (
> > * sg_init_one(E2, I1, E3)
> > |
> > * sg_set_buf(E2, I1, E3)
> > )
> >
> >
>
> $ cat sgstack.cocci
> @@
> local idexpression S;
> expression A, B;
> @@
>
> (
> * sg_init_one(A, &S, B)
> |
> * virt_to_phys(&S)
>
>
> not very inspiring. I barely understand Coccinelle syntax, and sadly
> I find the manual nearly incomprehensible. I can read the grammar,
> but that doesn't mean I know what the various declarations do.

Thanks for sharing it.

Thanks.