Re: linux kernel scheduler

From: Felipe Alfaro Solana
Date: Wed Dec 24 2003 - 20:19:28 EST


I've found the following document from Con Kolivas.
Maybe it can be of any help to you...
--- Begin Message --- This is a description of the interactivity patches for 2.5/6 that started with
O1int.

Background:
A long discussion for sometime has centred on just what interactivity is. I
did not set out to define interactivity and then modify the scheduler policy
to match that definition. Instead what I have concentrated on is improving
the overall feel of using 2.6 on a desktop (gui or non). The area I
concentrated on was that of tasks that it would make a noticable difference
when there was sufficient delay or jitter between the time they wake up and
the time they are scheduled. This makes a palpable difference in the form of
audio skips in playback or jerky mouse movement. The effect of increasing
loads on the system and maintaining fairness must also be addressed when
tackling these.

Introduction:
How the O(1) scheduler deals with "interactive" tasks.

The O(1) cpu scheduler designed by Ingo Molnar and merged into early 2.5
development was adopted for it's substantial improvement in scalability. The
O(1) nature of the new scheduler meant the overhead of increasing number of
tasks on increasing numbers of cpus was negligible.

Briefly, the scheduler maintains a runqueue for each cpu, with this split into
the active and expired array. How it works is thus: when a task is first
woken up, it is added to the tail end of the active array to be scheduled
after all tasks of higher priority on that same array, and any tasks of the
same priority already on the runqueue. If a task goes to sleep it then is
removed from the runqueue. If, however, it uses up a full timeslice the
scheduler has to decide whether to put it on the end of the active array to
be rescheduled again, or to be put onto the expired array. Tasks on the
active array round robin from highest priority first come first served to
lowest priority. Tasks on the expired array are not run unless the active
array is emptied (all tasks go to sleep or are expired) or some predetermined
starvation limit is used up.

The interactivity estimator in the 2.5 scheduler is designed to find which
tasks are interactive versus those that are batch (pure cpu hogs). How it
works is on the premise that batch tasks never sleep but use up all the cpu
time offered to them, whereas interactive tasks occasionally sleep. A
sleep_avg was stored for each task, where every tick of the jiffy (1
millisecond in 2.5) the task earns a sleep_avg point if it's sleeping, or
loses a sleep_avg point while it's running. Tasks with high sleep_avg are
considered interactive and low sleep_avg are cpu hogs.

The scheduler takes the information from the interactivity estimator and then
assigns a dynamic priority to the task. The variation from the static
priority is dependent on the value in PRIORITY_BONUS, set to 25% meaning a
task's priority can change +/- 5 from it's static. Interactive tasks get a 5
bonus and cpu hogs get a 5 penalty.

If a task uses up it's full timeslice, the scheduler can choose to not expire
if it is still TASK_INTERACTIVE which works out to about a priority bonus of
3+ for nice 0 tasks.

While this simple and intuitive method works quite well, it does have some
intrinsic limitations.

Problems:

1. How much sleep avg needs to be accumulated before a task is interactive
(MAX_SLEEP_AVG [MSA from here on]).
Decrease the value and it takes only a short time for a task to accumulate
enough sleep avg to be considered interactive, and also burn off enough sleep
avg to be seen as a cpu hog. The problem is that most real world interactive
applications, while they do sleep frequently, they also use bursts of cpu at
times. These bursts will rapidly drop the interactive status of a task if the
MSA is low, and these tasks will suddenly get low priority and be scheduled
with batch tasks or worse, expire. The best example of this is X which is
interactive but not infrequently uses large bursts of cpu. A low MSA will
make dragging a window smooth under no load, but as load increases, dragging
the window across the screen it will stop, jerk and the mouse will become
unresponsive due to X expiring. The other problem with low MSAs is that most
tasks, however interactive they are, start out using a lot of cpu. This means
that they will be seen as cpu hogs during startup and translates into very
slow application startup under load.
Increasing the value of MSA is a way to tackle to rapidly decaying interactive
status and slow startups but brings a different set of problems with it. In
2.6.0-test4, the default for MSA is 10 seconds. This means a task needs to
have slept for ten seconds before it is considered max interactive. Because
tasks are constantly waking up, burning cpu and then going back to sleep, a
MSA of 10 seconds means it can take up to a minute before the sleep_avg
accurately represents the task's interactive status (an exponential
function). This is seen by audio applications skipping like mad under any
load until the music has been playing for over a minute, and each time a new
song loads it starts all over again.

2.The idle task that turns into a cpu hog.
Tasks that idle a lot (shells etc) can suddently turn into cpu hogs very
easily (fire off "while true ; do a=1 ; done"). Idle tasks basically are
sleeping all the time so they get maximum interactivity, and if you have a
combination of a high MSA and an idle task becomes a cpu hog it can preempt
almost everything till it gets flagged as non interactive (takes 3 seconds at
an MSA of 10 seconds) and gets expired.

3. The cpu hog that sleeps while waiting on disk i/o.
Batch tasks (like compilers) never sleep on their own. However they often are
forced to wait on disk i/o and this registers as sleep. This artifically
raises their interactive status, and the more of these tasks there are
waiting on i/o that are hogs, the worse the scenario gets (witness the make
-j $largenumber). These tasks then swamp the cpu time.

4. Interactive tasks forking off cpu hogs.
A forked process inherits the sleep_avg of it's parent penalised by the
tunable CHILD_PENALTY. The default is 95 meaning it inherits 95% of the sleep
avg of it's parent. This is meant to stifle the ability of maximum
interactive tasks from forking other maximum interactive tasks. The problem
arises when an interactive task forks off a cpu hog, the cpu hog should not
really start at a high interactive level even if it drops later. Furthermore,
the parent usually sleeps after it forks off a cpu hog, elevating it's own
priority again. This makes for a parent that can repeatedly fire off almost
fully interactive children (make -j $largenumber). Decreasing the child
penalty helps, but incurs a fork startup slowdown. Applying a PARENT_PENALTY
helps, but this hurts the case where the parent is firing off lots of
interactive tasks (eg web server).

5. Intra jiffy wakeups.
The sleep avg of a task is incremented by the amount of time spent sleeping in
"jiffies" since it last went to sleep, and is decremented each tick of the
interrupt (1ms on i386). As hardware gets faster many tasks may wake up, use
cpu time and go to sleep before a single jiffy has passed. If they wake often
enough or sleep for short enough periods the resolution of the sleep avg will
completely misjudge them.

6. Scheduling latency.
Two problems exist in this area.
First is that the time spent on the runqueue is not credited as sleep time,
and if the runqueue is busy this can be a not-insignificant duration. This
makes for worsening elevation of priority under load.
Second, nice 0 tasks get 102ms of task_timeslice. If a task wakes up while
this first task is running and is not higher priority than it, it will wait
for the previous task to finish it's timeslice. Most audio apps wakeup around
every 50ms to do their work, which is often over in less than 2ms but if they
wait for long enough for other same priority tasks to finish eventually it
will lead to audio dropouts. Also some tasks need to run very frequently to
minimise the perceptible jitter (moving the mouse in X).

7. Priority inversion
An interactive task that wakes up a cpu hog to get it to do work normally
sleeps while waiting for the cpu hog to respond. A poorly written application
can continually keep looking for a response from the cpu hog, and if it is
higher priority than the cpu hog, it will preempt it. A spiral of the
interactive task waking up looking for info, preempting the cpu hog and not
allowing the cpu hog to get any cpu time can ensue. Eventually the
interactive task is seen to be wasting cpu, loses it's priority bonus to a
point where it is equal priority with the cpu hog and finally the cpu hog
gets scheduled. Wine/wineserver cpu intensive games seem to exhibit this, as
does v2.28 of blender, and kmail reading a pgp signed email.

While this is by no means an exhaustive list, it does describe some of the
more common issues.


Current work:

Mike Galbraith did a lot of the excellent earlier work on tackling the issue
of sleep avg resolution by using high resolution timers and nanosecond
resolution instead of milliseconds. Initially my efforts at interactivity
improvement did not use this work, but Ingo Molnar wrote some nice
infrastructure he considered essential that I then worked with.

Ingo's patch tackled the sleep_avg resolution problem, and added the on
runqueue bonus calculation. He went further to discriminate between on
runqueue tasks being woken up by interrupts or other tasks to give more bonus
to the interrupt tasks and weight down the other on runqueue tasks. Tasks
woken in interrupts tend to be interactive ones. The other major addon was
the introduction of TIMESLICE_GRANULARITY[TG from here on]. This would
prevent a task from using up it's full timeslice if there were other tasks of
the same priority on the active array. Basically after TG passed (initally
set to 50ms and later changed to 25ms) the task would be rescheduled with
it's remaining timeslice at the end of the queue of tasks at the same
priority. Thus a lot of nos. 5 and 6 above were already accounted for.


Approach:

What I set out to do was use the current infrastructure and not change any of
the basic workings so as to not change the overall performance of the
scheduler. Instead I focused on using the information presented to the
interactivity estimator to choose dynamic priority more carefully. Originally
the patches were meant to be the O(1)interactivity patches, abbreviated to
O1int, which were jokingly referred to as the orthogonal patches and the name
stuck. I'll concentrate on the latest evolution (O18.1) and bypass discussion
of bugs and failed approaches.


The MSA in my patches is much smaller (10 average timeslices or ~1second),
allowing tasks to rapidly change interactive status and allow cpu hogs to be
identified sooner. Other strategies were put into place to blunt the
disadvantage of this smaller MSA.

Instead of the sleep avg being a simple counter up and down I use the current
status of a task to determine how much sleep avg to add or subtract in a sort
of soft feedback loop. This introduces some autoregulation into the cpu
interactivity estimator. What it does is determines the current bonus based
on the sleep_avg to date (CURRENT_BONUS) to decide how much to weight the
sleep_avg credited to a task. As tasks drop priority they get weighted
proportionately more so as to recover rapidly from a low interactive state,
and allow new tasks to gain sleep_avg quickly after their initial burn.

A flag was introduced (interactive_credit) to detect truly interactive tasks
so that they wouldn't be penalised too much during their bursts of cpu
activity. If they reached the MSA and were still accumulating sleep_avg, they
would start getting interactive credits. Once they were beyond a cutoff (also
the MSA) they were flagged as HIGH_CREDIT, which then affected their
sleep_avg burn. This was weighted in the reverse direction as the gaining of
sleep time, so they slid down an ever increasing slope as they behaved more
like cpu hogs.

more subtle but helpful was minimising the amount of sleep_avg a LOW_CREDIT
task could obtain with each wakeup to just one priority bonus to prevent
rapid swings towards interactive once a task was a proven cpu hog.


Idle detection was implemented, where a task slept for a predetermined time it
would have a ceiling imposed on the amount of sleep avg it got, set to a
value that corresponded with ensuring it would just be considered as
interactive so it wouldn't expire easily. Kernel threads were excluded as
they are often idle, but when they wake up they need cpu time.

I/O wait of cpu hogs was difficult to detect directly (there is not enough
information in the kernel) but indirectly, most local disk i/o is associated
with a special kind of sleep (UNINTERRUPTIBLE_SLEEP seen as process in D).
Tasks flagged as waiting on this kind of sleep that weren't HIGH_CREDIT were
limited in their sleep_avg rise to just interactive. This prevents cpu hogs
getting maximum interactive state during disk i/o and prevents them getting
needlessly expired after disk i/o.


Forking was handled in my patch by keeping the sleep_avg at a level that would
not make forked children or their parents lose any priority bonus over what
CHILD_PENALTY and PARENT_PENALTY. However it would round down their sleep_avg
so it was just the amount necessary to be the same bonus. This meant they
remained the same priority but could easily drop priority if they then proved
to be cpu hogs.


Priority inversion is something that proved difficult to tackle. While a bug
in earlier incarnations of the orthogonal patches intensified the problem,
the generic solution in my patches now is that tasks that turn into cpu hogs
become less interactive at an accelerated rate the lower their interactivity
gets. This significantly reduces the time the inversion exists but doesn't
cure the problem. Correcting the bugs in software design are needed for that.
However it is only the dependent task that gets starved during priority
inversion, as the longest a single nice 0 fully interactive task can hold the
cpu over another similarly interactive task with my patches is 27ms.
Preventing inversion from occurring at all would require more infrastructure.


Other generic changes in my patches were added for throughput purposes.
The requeuing of tasks at TIMESLICE_GRANULARITY was limited to user tasks that
were interactive with at least MIN_TIMESLICE (10ms) remaining. This meant
that batch tasks (non interactive) are allowed to run out their full
timeslice before another task of the same priority gets scheduled. Also it
meant that a task was not rescheduled with such a small timeslice remaining
that it was inducing cache trashing. The default timeslice of nice 0 tasks is
102 ms which meant that if they were requeued every 25ms, they would be
requeued once with only 2ms remaining. Preemption originally was allowed if a
task was the same priority but had more than twice the remaining timeslice of
the process currently being scheduled. This is unecessary with TG in action
and simply added to the number of context switches so was removed.

TG set to 25ms puts us close to the mouse movement of desktop operating
systems, if X gets rescheduled with that as the largest wait period (it
usually is less). Most versions of windows update their mouse cursor at 50Hz,
and earlier MacOS' at 200Hz. I'm not sure about recent versions.



>From the testing I've done to date it appears this approach works well in most
settings and is an improvement in almost every palpable way to the default
scheduler. The remaining issue is that application startup will be slower
than the vanilla scheduler (your mileage may vary) under load, but I feel not
unacceptably so as a tradeoff. No further development is planned on these
patches (apart from perhaps cosmesis) unless other issues arise.


Thanks to the many contributors of ideas and previous patches and the numerous
testers out there. I hope that even if an alternative approach is used in the
final 2.6 kernel that my contributions and explanations have been helpful.


Con Kolivas
25 August 2003

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

--- End Message ---