Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler

From: Ingo Molnar (mingo@elte.hu)
Date: Fri Jan 04 2002 - 07:53:14 EST


On Fri, 4 Jan 2002, Anton Blanchard wrote:

> Good timing :) We were just looking at an application that hit
> sched_yield heavily on a large SMP machine (dont have the source so
> fixing the symptoms not the cause). Performance was pretty bad with
> the standard scheduler.

hey, great. I mean, what a pity :-| But in any case, sched_yield() is just
a tiny portion of the scheduler spectrum, and it's certainly not the most
important one.

> We managed to get a 4 way ppc64 machine to boot, but unfortunately the
> 18 way hung somewhere after smp initialisation and before execing
> init. More testing to come :)

could this be the child-runs-first problem perhaps? You can disable it by
exchanging wake_up_forked_process() for wake_up_process() in fork.c, and
removing the current->need_resched = 1 line.

> Is the idle loop optimisation (atomic exchange -1 into need_resched to
> avoid IPI) gone forever?

it's just broken temporarily, i will fix it. The two places the need to
set need_resched is the timer interrupt (that triggers periodic
load_balance()) and the wakeup code, i'll fix this in the next patch.

> Is my understanding of this right?:
>
> #define BITMAP_SIZE (MAX_RT_PRIO/8+1)
>
> ...
>
> char bitmap[BITMAP_SIZE+1];
> list_t queue[MAX_RT_PRIO];
>
> You have an bitmap of size MAX_RT_PRIO+1 (actually I think you have
> one too many +1 here :) [...]

[ yes :-) paranoia. will fix it. ]

> [...] and you zero the last bit to terminate it. You
> use find_first_zero_bit to search the entire priority list and
> sched_find_first_zero_bit to search only the first MAX_PRIO (64) bits.

correct.

the reason for this logic inversion is temporary as well: we'll have to
add find_next_set_bit before doing the more intuitive thing of setting the
bit when the runlist is active. Right now sched_find_first_zero_bit() has
to invert the value (on x86) again to get it right for BSFL.

> > - the SMP idle-task startup code was still racy and the new scheduler
> > triggered this. So i streamlined the idle-setup code a bit. We do not call
> > into schedule() before all processors have started up fully and all idle
> > threads are in place.
>
> I like this cleanup, it pushes more stuff out the arch specific code
> into init_idle().

the new rules are this: no schedule() must be called before all bits in
wait_init_idle are clear. I'd suggest for you to add this to the top of
schedule():

        if (wait_init_idle)
                BUG();

to debug the SMP startup code.

the other new property is that the init thread wakes itself up and then
later on becomes an idle thread just like the other idle threads. This
makes the idle startup code more symmetric, and the scheduler more
debuggable.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 07 2002 - 21:00:24 EST