[PATCH] sched/cputime: Resync time when guest & host lose sync

From: Wanpeng Li
Date: Mon Aug 15 2016 - 08:06:59 EST


From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>

Commit:

57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

... triggered a regression:

An i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
cpu hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe the top in
guest, there are 100% st for cpu0(housekeeping), and 75% st for other cpus
(nohz full mode). However, w/o this commit, 75% for all the four cpus.

As Rik and Paolo pointed out:

| It turns out that if a guest misses several timer ticks in a row, they
| will simply get lost.
|
| That means the functions calling steal_account_process_time may not know
| how much CPU time has passed since the last time it was called, but
| steal_account_process_time will get a good idea on how much time the host
| spent running something else.

This patch fix it by removing the max cputime limit for tick based sampling,
and keep the limit for vtime in order that steal_account_process_time() will
not attempt to remove more than the limit.

Suggested-by: Rik van Riel <riel@xxxxxxxxxx>
Suggsted-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Cc: Radim Krcmar <rkrcmar@xxxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
---
kernel/sched/cputime.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9858266..a119304 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime)
cpustat[CPUTIME_IDLE] += (__force u64) cputime;
}

+/*
+ * After a host system is overloaded, the missed clock ticks are not
+ * redelivered to guest later. Due to that, this function may on
+ * occasion account more time than the calling functions think elapsed.
+ */
static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
{
#ifdef CONFIG_PARAVIRT
@@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
* idle, or potentially user or system time. Due to rounding,
* other time can exceed ticks occasionally.
*/
- other = account_other_time(cputime);
+ other = account_other_time(ULONG_MAX);
if (other >= cputime)
return;
cputime -= other;
@@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
}

cputime = cputime_one_jiffy;
- steal = steal_account_process_time(cputime);
+ steal = steal_account_process_time(ULONG_MAX);

if (steal >= cputime)
return;
@@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks)
}

cputime = jiffies_to_cputime(ticks);
- steal = steal_account_process_time(cputime);
+ steal = steal_account_process_time(ULONG_MAX);

if (steal >= cputime)
return;
--
1.9.1