Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

From: Peter Zijlstra
Date: Mon Jul 23 2018 - 05:21:25 EST


On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
> The trace data corresponds to the last sample period:
> trace entry 1:
> cat-20755 [022] d... 1370.106496: cputime_adjust: task
> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
> cat-20755 [022] d... 1370.106497: cputime_adjust: result:
> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
> 2327067254
>
> trace entry 2:
> cat-20773 [005] d... 1371.109825: cputime_adjust: task
> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
> cat-20773 [005] d... 1371.109826: cputime_adjust: result:
> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
> 3236489210
>
> 1) expected behaviour
> Let's compare the last two trace entries(all the data below is in ns):
> task tick-based utime: 362560000000->362567000000 increased 7000000
> task tick-based stime: 2551000000 ->3547000000 increased 996000000
> scheduler rtime: 333060702626->334063718912 increased 1003016286
>
> The application actually runs almost 100%sys at the moment, we can
> use the task tick-based utime and stime increased to double check:
> 996000000/(7000000+996000000) > 99%sys
>
> 2) the current cputime_adjust() inaccurate result
> But for the current cputime_adjust(), we get the following adjusted
> utime and stime increase in this sample period:
> adjusted utime: 330733635372->330827229702 increased 93594330
> adjusted stime: 2327067254 ->3236489210 increased 909421956
>
> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>
> 3) root cause
> The root cause of the issue is that the current cputime_adjust() always
> passes the whole times to scale_stime() to split the whole utime and
> stime. In this patch, we pass all the increased deltas in 1) within
> user's sample period to scale_stime() instead and accumulate the
> corresponding results to the previous saved adjusted utime and stime,
> so guarantee the accurate usr and sys increase within the user sample
> period.

But why it this a problem?

Since its sample based there's really nothing much you can guarantee.
What if your test program were to run in userspace for 50% of the time
but is so constructed to always be in kernel space when the tick
happens?

Then you would 'expect' it to be 50% user and 50% sys, but you're also
not getting that.

This stuff cannot be perfect, and the current code provides 'sensible'
numbers over the long run for most programs. Why muck with it?