Re: [GIT PULL] scheduler fixes

From: Ingo Molnar
Date: Mon May 25 2009 - 01:16:51 EST



* Yinghai Lu <yinghai@xxxxxxxxxx> wrote:

> Ingo Molnar wrote:
> > * Yinghai Lu <yinghai@xxxxxxxxxx> wrote:
> >
> >> Pekka J Enberg wrote:
> >>> On Mon, 18 May 2009, Linus Torvalds wrote:
> >>>>>> I hate that stupid bootmem allocator. I suspect we seriously
> >>>>>> over-use it, and that we _should_ be able to do the SL*B init
> >>>>>> earlier.
> >>>>> Hm, tempting thought - not sure how to pull it off though.
> >>>> As far as I can recall, one of the things that historically made us want
> >>>> to use the bootmem allocator even relatively late was that the real SLAB
> >>>> allocator had to wait until all the node information etc was initialized.
> >>>>
> >>>> That's pretty damn late. And I wonder if SLUB (and SLOB) might not need a
> >>>> lot less initialization, and work much earlier. Something like that might
> >>>> be the final nail in the coffin for SLAB, and convince me to just say
> >>>> 'we don't support it any more".
> >>> Ingo, here's a patch that boots UMA+SMP+SLUB x86-64 kernel on qemu all
> >>> the way to userspace. It probably breaks bunch of things for now but
> >>> something for you to play with if you want.
> >>>
> >> updated with tip/master. also add change to cpupri_init
> >> otherwise will get
> >> [ 0.000000] Memory: 523096612k/537526272k available (10461k kernel code, 656156k absent, 13773504k reserved, 7186k data, 2548k init)
> >> [ 0.000000] SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=8
> >> [ 0.000000] ------------[ cut here ]------------
> >> [ 0.000000] WARNING: at kernel/lockdep.c:2282 lockdep_trace_alloc+0xaf/0xee()
> >> [ 0.000000] Hardware name: Sun Fire X4600 M2
> >> [ 0.000000] Modules linked in:
> >> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-rc6-tip-01778-g0afdd0f-dirty #259
> >> [ 0.000000] Call Trace:
> >> [ 0.000000] [<ffffffff810a0274>] ? lockdep_trace_alloc+0xaf/0xee
> >> [ 0.000000] [<ffffffff81075ab0>] warn_slowpath_common+0x88/0xcb
> >> [ 0.000000] [<ffffffff81075b15>] warn_slowpath_null+0x22/0x38
> >> [ 0.000000] [<ffffffff810a0274>] lockdep_trace_alloc+0xaf/0xee
> >> [ 0.000000] [<ffffffff8110301b>] kmem_cache_alloc_node+0x38/0x14d
> >> [ 0.000000] [<ffffffff813ec548>] ? alloc_cpumask_var_node+0x4a/0x10a
> >> [ 0.000000] [<ffffffff8109eb61>] ? lockdep_init_map+0xb9/0x564
> >> [ 0.000000] [<ffffffff813ec548>] alloc_cpumask_var_node+0x4a/0x10a
> >> [ 0.000000] [<ffffffff813ec62c>] alloc_cpumask_var+0x24/0x3a
> >> [ 0.000000] [<ffffffff819e6306>] cpupri_init+0x7f/0x112
> >> [ 0.000000] [<ffffffff819e5a30>] init_rootdomain+0x72/0xb7
> >> [ 0.000000] [<ffffffff821facce>] sched_init+0x109/0x660
> >> [ 0.000000] [<ffffffff82203082>] ? kmem_cache_init+0x193/0x1b2
> >> [ 0.000000] [<ffffffff821dfd7a>] start_kernel+0x218/0x3f3
> >> [ 0.000000] [<ffffffff821df2a9>] x86_64_start_reservations+0xb9/0xd4
> >> [ 0.000000] [<ffffffff821df3b2>] x86_64_start_kernel+0xee/0x109
> >> [ 0.000000] ---[ end trace a7919e7f17c0a725 ]---
> >>
> >> works with 8 sockets numa amd64 box.
> >>
> >> YH
> >>
> >> ---
> >> init/main.c | 28 ++++++++++++++++------------
> >> kernel/irq/handle.c | 23 ++++++++---------------
> >> kernel/sched.c | 34 +++++++++++++---------------------
> >> kernel/sched_cpupri.c | 9 ++++++---
> >> mm/slub.c | 17 ++++++++++-------
> >> 5 files changed, 53 insertions(+), 58 deletions(-)
> >
> > Very nice!
> >
> > Would it be possible to restructure things to move kmalloc init to
> > before IRQ init as well? We have a couple of uglinesses there too.
> >
> > Conceptually, memory should be the first thing set up in general, in
> > a kernel. It does not need IRQs, timers, the scheduler or any of the
> > IO facilities and abstractions. All of them need memory though - and
> > as Linux scales to more and more hardware via the same single image,
> > so will we get more and more dynamic concepts like cpumask_var_t and
> > sparse-irqs, which want to allocate very early.
>
> Pekka's patch already made kmalloc before early_irq_init()/init_IRQ...
>
> we can clean up alloc_desc_masks and
> alloc_cpumask_var_node could be much simplified too.

That's nice!

Ok, i think this all looks pretty realistic - but there's quite a
bit of layering on top of pending changes in the x86 and irq trees.
We could do this on top of those topic branches in -tip, and rebase
in the merge window. Or delay it to .32.

... plus i think we are _very_ close to being able to remove all of
bootmem on x86 (with some compatibility/migration mechanism in
place). Which bootmem calls do we have before kmalloc init with
Pekka's patch applied? I think it's mostly the page table init code.

( beyond the page allocator internal use - where we could use
straight e820 based APIs that clip memory off from the beginning
of existing e820 RAM ranges - enriched with NUMA/SRAT locality
info. )

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/