Re: Stolen and degraded time and schedulers

From: Dan Hecht
Date: Wed Mar 14 2007 - 15:02:30 EST


On 03/13/2007 09:37 PM, Jeremy Fitzhardinge wrote:
Dan Hecht wrote:
With your previous definition of work time, would it be that:

monotonic_time == work_time + stolen_time ??

(By monotonic time, I presume you mean monotonic real time.)

Yes, I was just trying to use some consistent terminology, so I picked linux (hrtimer.c) terms: CLOCK_REALTIME == wallclock, CLOCK_MONOTONIC == "real" time counter.

Yes, I
suppose you could, but I don't think that's terribly useful. I think
work_time is probably most naturally measured in cpu clock cycles rather
than an actual time unit. You could convert it to ns, but I don't see
the point.


Even cpu clock cycles doesn't really tell you how much "work" a cpu was able to get done. Different cpus have different throughputs per cycle per instruction sequence.


I know its a term in general use, but I don't think the term "stolen
time" is all that useful, particularly when we're talking about a more
general notion of cpu work contributing to the progress of process
execution. In the cpufreq case, time isn't "stolen" per se.


Right, and that's why I'm not sure I'm convinced the two should be confused. In the case of cpufreq you are talking about the cpu not doing as much work due to a choice the kernel made. In the case of stolen time, the choice wasn't made by the kernel, but instead the hypervisor. I understand they are somewhat similar from the perspective of the scheduler, but a bit different.

Also, I'm not sure this is the right thing for cpufreq because:
1) when the load is high, which is when this all matters, presumably the kernel will ramp up the cpu to full speed.
2) in the case where there are two different machines with two different process speeds (well, really processor throughputs), today the scheduler doesn't care about trying to adjust the timing due to one machine being faster than the other. I think this is might be design; i.e. the scheduler was intended to work in terms of real time units rather than work units. I guess you are arguing that this is incorrect, and the scheduler should be scheduling based on how much work it was able to get done. I'm not sure this makes sense because it was the kernel that decided slow the cpus and cause less work to be done.

In the case of stolen time, however, the kernel was even running at all on any pcpu, and it wasn't even up to the kernel to decide that.


(I guess I don't like the term stolen time because you don't refer to
time spent on other processes as being stolen from your process: its
just processor time being distributed.)


Okay. I think maybe it comes from the fact that most moder processes expect to time share the cpu, whereas most kernels do not expect this, and it has already been adopted by the kernel (cpustat->steal). But, I don't really care what we call it.

i.e. would you be defining stolen_time to include the time lost to
processes due to the cpu running at a lower frequency? How does this
play into the other potential users, besides sched_clock(), of stolen
time? We should make sure that the abstraction introduced here makes
sense in those places too.

Be specific. What other uses are there?


I listed them below. To summarize, there are (at least) three:

1) sched_clock, the main topic of this thread.
2) p->time_slice
3) cpustat->steal

For example, the stuff that happens in update_process_times(). I
think we'd want to account the stolen time to cpustat->steal.

I guess we could do something for that. Would we account non-full-speed
cpus to it? Maybe?

How is cpustat->steal used? How does it get out to usermode?


Via /proc/stat, used by modern 'top', maybe other utilities. It is useful to users who want to see where the time is really going from inside a guest when running on a (para)virtual machine.

I believe previous set of xen paravirt-ops patches already handled cases #2 and #3 (but no longer do since switching to clockevents), and the old vmitime code did also. Obviously, we need revamp this stuff to make it fit in with the new clockevents/hrtimer way of doing things.

Also, s390 and powerpc arch's already account steal time, as another reference point. (the old xen time code and vmi time code worked much in the same way as those). We should bring the s390 and powerpc folks into the discussion.


Also we'd probably want account for stolen time with regards to
task_running_tick(). (Though, in the latter case, maybe we first have
to move the scheduler away from assuming HZ rate decrementing of
p->time_slice to get this right. i.e. remove the tick based assumption
from the scheduler, and then maybe stolen time falls in more naturally
when accounting time slices).

I think the important part is that sched_clock() be used to actually
compute how much time each process gets. The fact that a time quantum
gets stolen is less important. Or do you mean something else?


How is time quantum getting stolen less important? Time quantum getting stolen results directly in more unnecessary context switches since we might steal the entire timeslice before the process even ran.

I actually think #2 and #3 might be more important than #1 (at least they are as important). And, the earlier Xen patches seem to agree with this, since they addressed 2 & 3 only.

I guess taking your cpufreq as an example of work_time progressing
slower than monotonic_time (and assuming that the remaining time is
what you would call stolen), then e.g. top would report 50% of your
cpu stolen when you cpu is running at 1/2 max rate.

Yes. In the same way that clock modulation gates the cpu clock, the
hypervisor effectively gates the clock by giving time to other vcpus.


Yes, except in one case it was a choice of the kernel, and in the other it was up to the hypervisor.

And p->time_slice would decrement at 1/2 the rate it normally did when
running at 1/2 speed. Is this the right thing to do? If so, then I
agree it makes sense to model hypervisor stolen time in terms of your
"work time".

Yes, that's my thought.


Since it was the kernel that decided to slow down the processor, I don't know if this makes sense.

Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/