Re: [PATCH] sched: Don't try to catch up excess steal time.

From: Suleiman Souhlal
Date: Fri Aug 23 2024 - 17:52:27 EST


On Wed, Aug 21, 2024 at 12:51 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> On Tue, 2024-08-06 at 20:11 +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future clock_task
> > measurements, as they end up getting additional steal time that did not
> > actually happen, from the previous excess steal time being paid back.
> >
> > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > different task runs on the same rq for 10ms without any time stolen.
> > Because of the current catch up mechanism, clock_sched inaccurately ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > In other words, tasks can end up getting more run time than they should
> > actually get.
> >
> > So, we instead don't make future updates pay back past excess stolen time.
>
> My understanding was that it was done this way for a reason: there is a
> lot of jitter between the "run time" (your 10ms example), and the steal
> time (15ms). What if 5ms really *did* elapse between the time that
> 'delta' is calculated, and the call to paravirt_steal_clock()?
>
> By accounting that steal time "in advance" we ensure it isn't lost in
> the case where the same process remains running for the next timeslice.

This is an interesting observation, but I'd argue that even in the scenario
where the same task stays running, the extra 5ms should not be accounted
for in steal time:
>From what I can tell, update_curr() is only updating the task's runtime by the
"delta" that was calculated, so the extra "stolen" time that happened between
the delta measurement and the steal time application does not seem to be
relevant. From what I can tell, that stolen time happened at a point where the
task is not being counted as running in its run time, so it should not be
accounted.

I am however struggling to express this in a way that is easily understandable
by others.

> However, that does cause problems when the steal time goes negative
> (due to hypervisor bugs). So in
> https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@xxxxxxxxxxxxx/
> I limited the amount of time which would be accounted to a future tick.

That would definitely be an improvement over the old behavior.

I noticed the issue because I've been trying to make sense of the numbers while
trying to include the time that host was suspended in steal time in
https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@xxxxxxxxxx/
(But I think the issue can happen even without these changes.)

Thanks,
-- Suleiman