Re: [BUG nohz]: wrong user and system time accounting

From: Frederic Weisbecker
Date: Thu Apr 13 2017 - 09:34:44 EST


On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner <tglx@xxxxxxxxxxxxx>:
> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> >> > It's not different from the current jiffies based stuff at all. Same
> >> > failure mode.
> >>
> >> Yes you're right, I got confused again. So to fix this we could do our snapshots
> >> at a frequency lower than HZ but still high enough to avoid overhead.
> >>
> >> Something like TICK_NSEC / 2 ?
> >
> > If you are using TSC anyway then you can do proper accumulation for both
> > system and user and only account the data when the accumulation is more
> > than a jiffie.
>
> So I implement it as below:
>
> - HZ=1000.
> 1) two cpu hogs on cpu in nohz_full mode, 100% user time
> 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
> - HZ=250
> 1) two cpu hogs on cpu in nohz_full mode, 100% user time
> 2) Luzi's testcase, 100% idle
>
> So the codes below still not work correctly for HZ=250, any suggestions?

Right, so first lets reorder that code a bit so we can see clear inside :-)

>
> -------------------------------------->8-----------------------------------------------------
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..6a11771 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -668,6 +668,8 @@ struct task_struct {
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqcount_t vtime_seqcount;
> unsigned long long vtime_snap;
> + u64 vtime_acct_utime;
> + u64 vtime_acct_stime;

You need to accumulate guest and steal time as well.

> enum {
> /* Task is sleeping or running in a CPU with VTIME inactive: */
> VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f8e54ba 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
> #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static u64 vtime_delta(struct task_struct *tsk)
> +static u64 vtime_delta(struct task_struct *tsk, bool user)
> {
> - unsigned long now = READ_ONCE(jiffies);
> + u64 delta, ret = 0;
>
> - if (time_before(now, (unsigned long)tsk->vtime_snap))
> - return 0;
> + delta = sched_clock() - tsk->vtime_snap;
>
> - return jiffies_to_nsecs(now - tsk->vtime_snap);
> + if (is_idle_task(tsk)) {
> + if (delta >= TICK_NSEC)
> + ret = delta;
> + } else {
> + if (user) {
> + tsk->vtime_acct_utime += delta;
> + if (tsk->vtime_acct_utime >= TICK_NSEC)
> + ret = tsk->vtime_acct_utime;
> + } else {
> + tsk->vtime_acct_stime += delta;
> + if (tsk->vtime_acct_utime >= TICK_NSEC)
> + ret = tsk->vtime_acct_stime;
> + }

We already have vtime_account_idle, vtime_account_user, etc...
The accumulation should be done by these functions that know what and where
to account. vtime_delta() should really just return the difference against vtime_snap,
it's too low level to care about these details.

> + }
> +
> + return ret;
> }
>
> -static u64 get_vtime_delta(struct task_struct *tsk)
> +static u64 get_vtime_delta(struct task_struct *tsk, bool user)
> {
> - unsigned long now = READ_ONCE(jiffies);
> - u64 delta, other;
> + u64 delta = vtime_delta(tsk, user);
> + u64 other;
> +
> + if (!is_idle_task(tsk)) {
> + if (user)
> + tsk->vtime_acct_utime = 0;
> + else
> + tsk->vtime_acct_stime = 0;
> + }

Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators.
Reset and accounting really should be done by the upper level functions
vtime_account_*()

Thanks.