Re: [patch] CFS scheduler, v3

From: Ingo Molnar
Date: Sat Apr 21 2007 - 04:58:49 EST



* William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:

> I suppose this is a special case of the dreaded priority inversion.
> What of, say, nice 19 tasks holding fs semaphores and/or mutexes that
> nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the
> default mutex implementation.

while i agree that it could be an issue, lock inversion is nothing
really new, so i'd not go _that_ drastic to convert all mutexes to
rtmutexes. (i've taken my -rt/PREEMPT_RT hat off)

For example reiser3 based systems get pretty laggy on significant
reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is
enabled: reiser3 holds the BKL for extended periods of time so a "make
-j50" workload can starve it significantly and the tty layer's BKL use
makes any sort of keyboard (even over ssh) input laggy.

Other locks though are not held this frequently and the mutex
implementation is pretty fair for waiters anyway. (the semaphore
implementation is not nearly as much fair, and the Big Kernel Semaphore
is still struct semaphore based) So i'd really wait for specific
workloads to trigger problems, and _maybe_ convert certain mutexes to
rtmutexes, on an as-needed basis.

> > In any case, it is clear that rq->raw_cpu_load should be used instead of
> > rq->nr_running, when calculating the fair clock, but i begin to like the
> > nice_offset solution too in addition of this: it's effective in practice
> > and starvation-free in theory, and most importantly, it's very simple.
> > We could even make the nice offset granularity tunable, just in case
> > anyone wants to weaken (or strengthen) the effectivity of nice levels.
> > What do you think, can you see any obvious (or less obvious)
> > showstoppers with this approach?
>
> ->nice_offset's semantics are not meaningful to the end user,
> regardless of whether it's effective. [...]

yeah, agreed. That's one reason why i didnt make it tunable, it's pretty
meaningless to the user.

> [...] If there is something to be tuned, it should be relative shares
> of CPU bandwidth (load_weight) corresponding to each nice level or
> something else directly observable. The implementation could be
> ->nice_offset, if it suffices.
>
> Suppose a table of nice weights like the following is tuned via
> /proc/:
>
> -20 21 0 1
> -1 2 19 0.0476

> Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

ok, thanks for thinking about it. I have changed the nice weight in
CVSv5-to-be so that it defaults to something pretty close to your
suggestion: the ratio between a nice 0 loop and a nice 19 loop is now
set to about 2%. (This something that users requested for some time, the
default ~5% is a tad high when running reniced SETI jobs, etc.)

the actual percentage scales almost directly with the nice offset
granularity value, but if this should be exposed to users at all, i
agree that it would be better to directly expose this as some sort of
'ratio between nice 0 and nice 19 tasks', right? Or some other, more
finegrained metric. Percentile is too coarse i think, and using 0.1%
units isnt intuitive enough i think. The sysctl handler would then
transform that 'human readable' sysctl value into the appropriate
internal nice-offset-granularity value (or whatever mechanism the
implementation ends up using).

I'd not do this as a per-nice-level thing but as a single value that
rescales the whole nice level range at once. That's alot less easy to
misconfigure and we've got enough nice levels for users to pick from
almost arbitrarily, as long as they have the ability to influence the
max.

does this sound mostly OK to you?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/