Re: [REPORT] cfs-v4 vs sd-0.44

From: Ingo Molnar
Date: Mon Apr 23 2007 - 16:34:19 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> > The "give scheduler money" transaction can be both an "implicit
> > transaction" (for example when writing to UNIX domain sockets or
> > blocking on a pipe, etc.), or it could be an "explicit transaction":
> > sched_yield_to(). This latter i've already implemented for CFS, but
> > it's much less useful than the really significant implicit ones, the
> > ones which will help X.
>
> Yes. It would be wonderful to get it working automatically, so please
> say something about the implementation..

i agree that the devil will be in the details, but so far it's really
simple. I'll put all this into separate helper functions so that places
can just use it in a natural way. The existing yield-to bit is this:

static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
struct rb_node *curr, *next, *first;
struct task_struct *p_next;

/*
* yield-to support: if we are on the same runqueue then
* give half of our wait_runtime (if it's positive) to the other task:
*/
if (p_to && p->wait_runtime > 0) {
p->wait_runtime >>= 1;
p_to->wait_runtime += p->wait_runtime;
}

the above is the basic expression of: "charge a positive bank balance".

(we obviously dont want to allow people to 'share' their loans with
others ;), nor do we want to allow a net negative balance. CFS is really
brutally cold-hearted, it has a strict 'no loans' policy - the easiest
economic way to manage 'inflation', besides the basic act of not
printing new money, ever.)

[note, due to the nanoseconds unit there's no rounding loss to worry
about.]

that's all. No runqueue locking, no wakeup decisions even! [Note: see
detail #1 below for cases where we need to touch the tree]. Really
low-overhead. Accumulated 'new money' will be acted upon in the next
schedule() call or in the next scheduler tick, whichever comes sooner.
Note that in most cases when tasks communicate there will be a natural
schedule() anyway, which drives this.

p->wait_runtime is also very finegrained: it is in nanoseconds, so a
task can 'pay' at arbitrary granularity in essence, and there is in
essence zero 'small coin overhead' and 'conversion loss' in this money
system. (as you might remember, sharing p->timeslice had inherent
rounding and sharing problems due to its low jiffy resolution)

detail #1: for decoupled workloads where there is no direct sleep/wake
coupling between worker and producer, there should also be a way to
update a task's position in the fairness tree, if it accumulates
significant amount of new p->wait_runtime. I think this can be done by
making this an extra field: p->new_wait_runtime, which gets picked up by
the task if it runs, or which gets propagated into the task's tree
position if the p->new_wait_runtime value goes above the
sched_granularity_ns value. But it would work pretty well even without
this, the server will take advantage of the p->new_wait_runtime
immediately when it runs, so as long as enough clients 'feed' it with
money, it will always have enough to keep going.

detail #2: changes to p->wait_runtime are totally lockless, as long as
they are 64-bit atomic. So the above code is a bit naive on 32-bit
systems, but no locking is needed otherwise, other than having a stable
reference to a task structure. (i designed CFS for 64-bit systems)

detail #3: i suspect i should rename p->wait_runtime to a more intuitive
name - perhaps p->right_to_run? I want to avoid calling it p->timeslice
because it's not really a timeslice, it's the thing you earned, the
'timeslice' is a totally locally decided property that has no direct
connection to this physical resource. I also dont want to call it
p->cpu_credit, because it is _not_ a credit system: every positive value
there has been earned the hard way: by 'working' for the system via
waiting on the runqueue - scaled down to the 'expected fair runtime' -
i.e. roughly scaled down by 1/rq->nr_running.

detail #3: the scheduler is also a charity: when it has no other work
left it will let tasks execute "for free" ;-) But otherwise, in any sort
of saturated market situation CFS is very much a cold hearted
capitalist.

about the 50% rule: it was a totally arbitrary case for yield_to(), and
in other cases it should rather be: "give me _all_ the money you have,
i'll make it work for you as much as i can". And the receiver should
also perhaps record the amount of 'money' it got from the client, and
_give back_ any unused proportion of it. (only where easily doable, in
1:1 task relationships) I.e.:

p_to->wait_runtime = p->wait_runtime;
p->wait_runtime = 0;

schedule();

the former two lines put into a sched_pay(p) API perhaps?

> The "perfect" situation would be that when somebody goes to sleep, any
> extra points it had could be given to whoever it woke up last. Note
> that for something like X, it means that the points are 100%
> ephemeral: it gets points when a client sends it a request, but it
> would *lose* the points again when it sends the reply!

yeah, exactly. X could even attempt to explicitly manage some of those
payments it received: it already has an internal automatic notion of
'expensive clients' (which do large X requests).

> There are subtle semantics with these kinds of things: especially if
> the scheduling points are only awarded when a process goes to sleep,
> if X is busy and continues to use the CPU (for another client), it
> wouldn't give any scheduling points back to clients and they really do
> accumulate with the server. Which again sounds like it would be
> exactly the right thing (both in the sense that the server that runs
> more gets more points, but also in the sense that we *only* give
> points at actual scheduling events).

yeah. Not only would it cause accumulation of 'money', that money would
buy it lower latencies as well: with more p->wait_runtime X will get on
the CPU faster. So it's a basic "work batching" mechanism.

> But how do you actually *give/track* points? [...]

tracking: these points are all hard-earned and their sum in the system
has a maximum. I.e. any point you 'give' to X was something you already
earned and it's something that the scheduler would have given CPU time
for in the near future. So as long as the transaction is balanced (the
total sum does not change), this does not create any unfairness
anywhere.

giving: it can be done lockless (64-bit atomic, they are nanoseconds).
'Collecting' p->new_wait_runtime up to sched_granularity_ns will also
make sure this field is the only typical impact that clients have on the
server - the tree position will only be recalculated at a frequency
determined by sched_granularity_ns.

> [...] A simple "last woken up by this process" thing that triggers
> when it goes to sleep? It might work, but on the other hand,
> especially with more complex things (and networking tends to be pretty
> complex) the actual wakeup may be done by a software irq. Do we just
> say "it ran within the context of X, so we assume X was the one that
> caused it?" It probably would work, but we've generally tried very
> hard to avoid accessing "current" from interrupt context, including
> bh's..

i'd concentrate on specific synchronous instances first, where we know
the producer and the consumer.

Later on we could attach "money" to localhost network packets for
example, to 'spread' fairness into more complex parts of the system.
(Note that such 'money' could even be passed along with a _physical_
packet, for example in a cluster - so it makes lots of sense on both the
small and the large scale.) I'd not directly attach 'money' to softirqs,
i'd attach it to the _object of work_: the packet, the workqueue entry,
the IO request, etc., etc.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/