Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

From: Xunlei Pang
Date: Wed Jun 27 2018 - 08:23:05 EST


On 6/26/18 11:49 PM, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote:
>> On 6/22/18 3:15 PM, Xunlei Pang wrote:
>>> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
>>> and encountered a problem that user and sys usages are wrongly
>>> split sometimes.
>>>
>>> Run tasks with some random run-sleep pattern for a long time, and
>>> when tick-based time and scheduler sum_exec_runtime hugely drifts
>>> apart(scheduler sum_exec_runtime is less than tick-based time),
>>> the current implementation of cputime_adjust() will produce less
>>> sys usage than the actual use after changing to run a different
>>> workload pattern with high sys. This is because total tick-based
>>> utime and stime are used to split the total sum_exec_runtime.
>>>
>>> Same problem exists on utime and stime from "/proc/<pid>/stat".
>>>
>>> [Example]
>>> Run some random run-sleep patterns for minutes, then change to run
>>> high sys pattern, and watch.
>>> 1) standard "top"(which is the correct one):
>>> 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>>> 2) our tool parsing utime and stime from "/proc/<pid>/stat":
>>> 20.5 usr, 78.4 sys
>>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
>>> gradually with time: 9.7 usr, 89.5 sys
>>>
>> High sys probably means there's something abnormal on the kernel
>> path, it may hide issues, so we should make it fairly reliable.
>> It can easily hit this problem with our per-cgroup statistics.
>>
>> Hi Peter, any comment on this patch?
>
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.
>

Let me try the best to describe it again.

There are two types of run time for a process:
1) task_struct::utime, task_struct::stime in ticks.
2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns.

In case of no vtime accounting, the utime/stime fileds of
/proc/pid/stat are calculated by cputime_adjust(), which
splits the precise rtime in the proportion of tick-based
utime and stime.

However cputime_adjust() always does the split using the
total utime/stime of the process, this may cause wrong
splitting in some cases, e.g.

A typical statistic collector accesses "/proc/pid/stat".
1) moment t0
After accessed /proc/pid/stat in t0:
tick-based whole utime is utime_0, tick-based whole stime
is stime_0, scheduler time is rtime_0. The ajusted utime
caculated by cputime_adjust() is autime_0, ajusted stime
is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0).

For a long time, the process runs mainly in userspace with
run-sleep patterns, and because two different clocks, it
is possible to have the following condition:
rtime_0 < utime_0 (as with little stime_0)

2) moment t1(after dt, i.e. t0+dt)
Then the process suddenly runs 100% sys workload afterwards
lasting "dt", when accessing /proc/pid/stat at t1="t0+dt",
both rtime_0 and stime_0 increase "dt", thus cputime_ajust()
does the calculation for new adjusted astime_1 as follows:
(rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt
< rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0)
< rtime_0*stime_0/(utime_0+stime_0) + dt
< astime_0+dt

The actual astime_1 should be "astime_0+dt"(as it runs 100%
sys during dt), but the caculated figure by cputime_adjust()
becomes much smaller, as a result the statistics collector
shows less cpu sys usage than the actual one.

That's why we occasionally observed the incorrect cpu usage
described in the changelog:
[Example]
Run some random run-sleep patterns for minutes, then change
to run high sys pattern, and watch.
1) standard "top"(which is the correct one):
4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) Parsing utime and stime from "/proc/<pid>/stat":
20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys


Thanks,
Xunlei