Re: Stolen and degraded time and schedulers

From: Jeremy Fitzhardinge
Date: Thu Mar 15 2007 - 15:34:18 EST


Paul Mackerras wrote:
> A cycle on one thread of a machine with SMT/hyperthreading when the
> other thread is idle *isn't* equivalent to a cycle when the other
> thread is busy. We run into this on POWER5, where we have hardware
> that counts cycles when each of the two threads in each core gets to
> dispatch instructions (on each cycle, one thread or the other gets to
> dispatch). That helps but still doesn't give a totally accurate
> estimate of how much computation a given process has managed to do.
>

Yes, but it doesn't need to be 100% accurate to be useful; it just needs
to better characterize the amount of work done. You could get a better
approximation by using two scaling factors: work done with other thread
idle, and work done when other thread busy.

>> I often nice my kernel builds
>> (which cpufreq takes as a hint to not ramp up the cpu speed) on my
>> laptop so to save power.
>>
>
> Just as a side note - that's probably actually a bad strategy; you
> almost certainly consume less total energy by running the cpu at full
> speed until the build is done and then going to the deepest sleep mode
> you can achieve.
>

It seems to me that a 5min build at 1/4 power is better than running for
2.5min at full power - voltage scaling gives you n^2 power use scaling,
remember. Not that I've measured it or anything.

> What was the original proposal? I came into this discussion late...
>

My core proposal is basically that sched_clock() should try to return a
time which scales with the amount of work done by a CPU rather than
measure real time. This helps solve two problems:

* it accounts for time stolen by a hypervisor, since a stolen CPU
does no work
* it accounts for cpus running a lower operating points, since they
do less work per unit time

You could also use it to take into account time stolen by interrupts,
SMM, thermal limiting and so on. The idea is that processes shouldn't
get penalized for CPU time that they had no opportunity to use.

This is almost completely compatible with how sched_clock() is currently
used, except that the scheduler also uses it to measure sleeping time.
This doesn't make much sense in my proposal because sched_clock() is an
inherently per-CPU time measure, and sleeping doesn't involve any CPU by
definition. It also doesn't make much sense to say that a process slept
for less time simply because it wasn't using the CPU while it was being
stolen/running slowly.

Despite this, it works better than expected because the current
scheduler adjusts the sched_clock-derived process timestamps as they
move between runqueues, so they never get too far out of whack.

I've implemented sched_clock to only count non-stolen CPU nanoseconds in
the Xen-paravirt_ops implementation; we'll see how it works out.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/