Re: CFS review

From: Ingo Molnar
Date: Thu Aug 02 2007 - 12:09:35 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> It would be better, I suspect, to make the scheduler clock totally
> distinct from the other clock sources (many architectures have per-cpu
> cycle counters), and *not* try to even necessarily force it to be a
> "time-based" one.

yeah. Note that i largely detached sched_clock() from the GTOD
clocksources already in CFS, so part of this is already implemented and
the intention is clear. For example, when the following happens:

Marking TSC unstable due to: possible TSC halt in C2.
Clocksource tsc unstable (delta = -71630388 ns)

sched_clock() does _not_ stop using the TSC. It is very careful with the
TSC value, it checks against wraps, jumping, etc. (the whole rq_clock()
wrapper around sched_clock()), but still tries to use the highest
resolution time source possible, even if that time source is not good
enough for GTOD's purposes anymore. So the scheduler clock is already
largely detached from other clocksources in the system.

> It would still be true that something that is purely based on timer
> ticks will always be liable to have rounding errors that will
> inevitably mean that you don't get good fairness - tuning threads to
> run less than a timer tick at a time would effectively "hide" them
> from the scheduler accounting. However, in the end, I think that's
> pretty much unavoidable.

Note that there is a relatively easy way of reducing the effects of such
intentional coupling: turn on CONFIG_HIGH_RES_TIMERS. That decouples the
scheduler tick from the jiffy tick and works against such 'exploits' -
_even_ if the scheduler clock is otherwise low resolution. Also enable
CONFIG_NO_HZ and the whole thing (of when the scheduler tick kicks in)
becomes very hard to predict.

[ So while in a low-res clock situation scheduling will always be less
precise, with hres-timers and dynticks we have a natural 'random
sampler' mechanism so that no task can couple to the scheduler tick -
accidentally or even intentionally.

The only 'unavoidable coupling' scenario is when the hardware has only
a single, low-resolution time sampling method. (that is pretty rare
though, even in the ultra-embedded space. If a box has two independent
hw clocks, even if they are low resolution, the timer tick can be
decoupled from the scheduler tick.) ]

> I have to say, it would be interesting to try to use 32-bit
> arithmetic.

yeah. We tried to do as much of that as possible, please read on below
for (many) more details. There's no short summary i'm afraid :-/

Most importantly, CFS _already_ includes a number of measures that act
against too frequent math. So even though you can see 64-bit math code
in it, it's only rarely called if your clock has a low resolution - and
that happens all automatically! (see below the details of this buffered
delta math)

I have not seen Roman notice and mention any of these important details
(perhaps because he was concentrating on finding faults in CFS - which a
reviewer should do), but those measures are still very important for a
complete, balanced picture, especially if one focuses on overhead on
small boxes where the clock is low-resolution.

As Peter has said it in his detailed review of Roman's suggested
algorithm, our main focus is on keeping total complexity down - and we
are (of course) fundamentally open to changing the math behind CFS, we
ourselves tweaked it numerous times, it's not cast into stone in any
way, shape or form.

> I also think it's likely a mistake to do a nanosecond resolution.
> That's part of what forces us to 64 bits, and it's just not even an
> *interesting* resolution.

yes, very much not interesting, and we did not pick nanoseconds because
we find anything "interesting" in that timescale. Firstly, before i go
into the thinking behind nanoseconds, microseconds indeed have
advantages too, so the choice was not easy, see the arguments in favor
of microseconds below at: [*].

There are two fundamental reasons for nanoseconds:

1) they _automatically_ act as a 'carry-over' for fractional math and
thus reduce rounding errors. As Roman has noticed we dont care much
about math rounding errors in the algorithm: _exactly_ because we
*dont have to* care about rounding errors: we've got extra 10
"uninteresting" bits in the time variables to offset the effects of
rounding errors and to carry over fractionals.

( Sidenote: in fact we had some simple extra anti-rounding-error
code in CFS and removed it because it made no measurable
difference. So even in the current structure there's additional
design reserves that we could tap before having to go to another
math model. All we need is a testcase that demonstrates rounding
errors. Roman's testcase was _not_ a demonstration of math
rounding errors, it was about clock granularity! )

2) i tried microseconds resolution once (it's easy) but on fast hw it
already showed visible accounting/rounding artifacts in
high-frequency scheduling workloads, which, if hw gets just a little
bit faster, will become pain.

( Sidenote: if a workload is rescheduling once every few
microseconds, then it very much matters to the balance of things
whether there's a fundamental +- 1 microsecond skew on who gets
accounted what. In fact the placement of sched_clock() call within
schedule() is already visible in practice in some testcases,
whether the runqueue-spinlock acquire spinning time is accounted
to the 'previous' or the 'next' task - despite that time being
sub-microsecond on average. Going to microseconds makes this too
coarse. )

I.e. microseconds are on the limit today on fast hardware, and
nanoseconds give us an automatic buffer against rounding errors.

On _slow_ hardware, with a low-resolution clock, i very much agree that
we should not do too frequent math, and there are already four
independent measures that we did in CFS to keep the math overhead down:

Firstly, CFS fundamentally "buffers the math" via deltas, _everywhere_
in the fastpath:

if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) {
__update_curr(cfs_rq, curr, now);
curr->delta_exec = 0;
}

I.e. we only call the math routines if there was any delta. The beauty
is that this "math buffering" works _automatically_ if there is a
low-resolution sched_clock() present, because with a low-resolution
clock a delta is only observed if a tick happens. I.e. in a
high-frequency scheduling workload (the only place where scheduler
overhead would be noticeable) all the CFS math is in a rare _slowpath_,
that gets triggered only every 10 msecs or so! (if HZ=1000)

I.e. we didnt have to go down the (very painful) path of ktime_t
split-model math and we didnt have to introduce a variable precision
model for "scheduler time" either, because the delta math itself
automatically buffers on slow boxes.

Secondly, note that all the key fastpath variables are already 32-bit
(on 32-bit platforms):

long wait_runtime;
unsigned long delta_fair_run;
unsigned long delta_fair_sleep;
unsigned long delta_exec;

The _timestamps_ are still 64-bit, but most of the actual math goes on
in 32-bit delta variables. That's one of the advantages of doing deltas
instead of absolute values.

Thirdly, if even this amount of buffering is not enough for an
architecture, CFS also includes the sched_stat_granularity_ns tunable
that allows the further reduction of the sampling frequency (and the
frequency of us having to do the math) - so if the math overhead is a
problem an architecture can set it.

Fourthly, in CFS there is a _single_ 64-bit division, and even for that
division, the actual values passed to it are typically in a 32-bit
range. Hence we've introduced the following optimization:

static u64 div64_likely32(u64 divident, unsigned long divisor)
{
#if BITS_PER_LONG == 32
if (likely(divident <= 0xffffffffULL))
return (u32)divident / divisor;
do_div(divident, divisor);

return divident;
#else
return divident / divisor;
#endif
}

Which, if the divident is in 32-bit range, does a 32-bit division.

About math related rounding errors mentioned by Roman (not to be
confused with clock granularity rounding), in our analysis and in our
experience rounding errors of the math were never an issue in CFS, due
to the extra buffering that nanosecs gives - i tweaked it a bit around
CFSv10 but it was unprovable to have any effect. That's one of the
advantages of working with nanoseconds: the fundamental time unit
includes about 10 "low bits" that can carry over much of the error and
reduce rounding artifacts. And even that math rounding errors we believe
centers around 0.

In Roman's variant of CFS's algorithm the variables are 32-bit, but the
error is rolled forward in separate fract_* (fractional) 32-bit
variables, so we still have 32+32==64 bit of stuff to handle. So we
think that in the end such a 32+32 scheme would be more complex (and
anyone who took a look at fs2.c would i think agree - it took Peter a
day to decypher the math!) - but we'll be happy to be surprised with
patches of course :-)

Ingo

[*] the main advantage of microseconds would be that we could use "u32"
throughout to carry around the "now" variable (timestamp). That
property of microseconds _is_ tempting and would reduce the
task_struct footprint as well. But if we did that it would have
ripple effects: we'd have to resort to (pretty complex and
non-obvious) math to act against rounding errors. We'd either have
to carry the rounding error with us in separate 32-bit variables (in
essence creating 32+32 bit 64-bit variable), or we'd have to shift
up the microseconds by say ... 10 binary positions ... in essence
bringing us back into the same nanoseconds range. And then the
wraparound problem of microseconds - 72 hours is not _that_
unrealistic to trigger intentionally, so we have to do _something_
about it to inhibit infinite starvation. We experimented
around with all this and the overwhelming conclusion (so far) was
that trying to reduce timestamps back to 32 bits is just not worth
it. _Deltas_ should, can and _are_ 32-bit values already, even in
the nanosecond model. So all the buffering and delta logic gives us
most of the 32-bit advantages already, without the disadvantages of
microseconds.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/