Re: Stolen and degraded time and schedulers

From: Jeremy Fitzhardinge
Date: Wed Mar 14 2007 - 15:34:44 EST


Dan Hecht wrote:
> Yes, I was just trying to use some consistent terminology, so I picked
> linux (hrtimer.c) terms: CLOCK_REALTIME == wallclock, CLOCK_MONOTONIC
> == "real" time counter.

OK. I had used "monotonic" in its more general sense earlier in the
thread, and I wanted to be sure.

> Even cpu clock cycles doesn't really tell you how much "work" a cpu
> was able to get done. Different cpus have different throughputs per
> cycle per instruction sequence.

Sure. But on a given machine, the CPUs are likely to be closely enough
matched that a cycle on one CPU is more or less equivalent to a cycle on
another CPU. The fact that a cycle represents a different amount of
work on an i486 compared to Core2 doesn't matter much. The important
part is that when the scheduler is doling out CPU time is comparing
everyone's usages with a common unit.

> Right, and that's why I'm not sure I'm convinced the two should be
> confused. In the case of cpufreq you are talking about the cpu not
> doing as much work due to a choice the kernel made. In the case of
> stolen time, the choice wasn't made by the kernel, but instead the
> hypervisor. I understand they are somewhat similar from the
> perspective of the scheduler, but a bit different.
Yes, but the question is whether it matters all that much? Does it
matter enough to make them two separate concepts, when one seems to
cover all the important points?

> Also, I'm not sure this is the right thing for cpufreq because:
> 1) when the load is high, which is when this all matters, presumably
> the kernel will ramp up the cpu to full speed.

Not at all. You might have an unimportant but cpu-bound process which
doesn't merit increasing the cpu speed, but should also be scheduled
properly compared to other processes. I often nice my kernel builds
(which cpufreq takes as a hint to not ramp up the cpu speed) on my
laptop so to save power.

> 2) in the case where there are two different machines with two
> different process speeds (well, really processor throughputs), today
> the scheduler doesn't care about trying to adjust the timing due to
> one machine being faster than the other.

It doesn't matter. The scheduler is only important when there's
contention for the cpu, and if there is, that it compare process CPU
usage with the same unit. What that unit isn't inherently very
important, so long as its consistent.

> I'm not sure this makes sense because it was the kernel that decided
> slow the cpus and cause less work to be done.

That's true. But this is a case of the left brain not talking to the
right brain: cpufreq might decide to slow a cpu down, but the scheduler
doesn't take that into account. Making the timebase of sched_clock
reflect the current cpu speed (or more specifically, the integral of the
cpu speed over a time interval) is a good way of communicating between
the two subsystems.

> In the case of stolen time, however, the kernel was even running at
> all on any pcpu, and it wasn't even up to the kernel to decide that.

As things stand now, there's not much difference from the scheduler's
perspective, since the scheduler takes no action in either case.

> I listed them below. To summarize, there are (at least) three:
>
> 1) sched_clock, the main topic of this thread.
> 2) p->time_slice

So, this is the target process timeslice, in units of sched_clock's
timebase?

> 3) cpustat->steal
>
>>> For example, the stuff that happens in update_process_times(). I
>>> think we'd want to account the stolen time to cpustat->steal.
>>
>> I guess we could do something for that. Would we account non-full-speed
>> cpus to it? Maybe?
>>
>> How is cpustat->steal used? How does it get out to usermode?
>>
>
> Via /proc/stat, used by modern 'top', maybe other utilities. It is
> useful to users who want to see where the time is really going from
> inside a guest when running on a (para)virtual machine.
>
> I believe previous set of xen paravirt-ops patches already handled
> cases #2 and #3 (but no longer do since switching to clockevents), and
> the old vmitime code did also. Obviously, we need revamp this stuff
> to make it fit in with the new clockevents/hrtimer way of doing things.

I added stolen time accounting to xen-pv_ops last night. For Xen, at
least, it wasn't hard to fit into the clockevent infrastructure. I just
update the stolen time accounting for each cpu when it gets a timer
tick; they seem to get a tick every couple of seconds even when idle.

Similarly, implementing sched_clock as "number of ns the vcpu spent in
running state" is simple and direct (though this makes it an explicitly
per-cpu clock; comparing raw sched_clock values between cpus will be
meaningless; but that's likely true when using the tsc as a timebase too).

>> I think the important part is that sched_clock() be used to actually
>> compute how much time each process gets. The fact that a time quantum
>> gets stolen is less important. Or do you mean something else?
>>
>
> How is time quantum getting stolen less important? Time quantum
> getting stolen results directly in more unnecessary context switches
> since we might steal the entire timeslice before the process even ran.

It doesn't matter why you didn't get the time; the important part is
that you know that time went missing. Its true that you may end up with
some spurious rescheds, but that seems like the kind of thing you'd want
to measure as being a problem before getting clever in fixing.

> I actually think #2 and #3 might be more important than #1 (at least
> they are as important). And, the earlier Xen patches seem to agree
> with this, since they addressed 2 & 3 only.

I'd call that an oversight. Xen has everything needed to implement
sched_clock in terms of non-stolen time.

>> Yes. In the same way that clock modulation gates the cpu clock, the
>> hypervisor effectively gates the clock by giving time to other vcpus.
>>
>
> Yes, except in one case it was a choice of the kernel, and in the
> other it was up to the hypervisor.

Not necessarily. The cpu might drop into thermal protection clock
modulation.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/