[BUG] Linux process vruntime accounting in virtualized environment

From: suokun
Date: Fri Apr 15 2016 - 15:27:56 EST


In virtualized environments, sometimes we need to limit the CPU
resources to a virtual machine(VM). For example in Xen, we use xl
sched-credit -d 1 -c 50 to limit the CPU resource of dom 1 as half of
one physical CPU core. If the VM CPU resource is capped, the process
inside the VM will have a vruntime accounting problem. Here, I report
my findings about Linux process scheduler under the above scenario.


------------Description------------
Linux CFS relies on delta_exec to charge the vruntime of processes.
The variable delta_exec is the difference of a process starts and
stops running on a CPU. This works well in physical machine. However,
in virtual machine under capped resources, some processes might be
accounted with inaccurate vruntime.

For example, suppose we have a VM which has one vCPU and is capped to
have as much as 50% of a physical CPU. When process A inside the VM
starts running and the CPU resource of that VM runs out, the VM will
be paused. Next round when the VM is allocated new CPU resource and
starts running again, process A stops running and is put back to the
runqueue. The delta_exec of process A is accounted as its "real
execution time" plus the paused time of its VM. That will make the
vruntime of process A much larger than it should be and process A
would be scheduled again for a long time until the vruntimes of other
processes catch it.
---------------------------------------


------------Analysis----------------
When a process stops running and is going to put back to the runqueue,
update_curr() will be executed.

[src/kernel/sched/fair.c]

static void update_curr(struct cfs_rq *cfs_rq)
{
... ...
delta_exec = now - curr->exec_start;
... ...
curr->exec_start = now;
... ...
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);

curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);
... ...
}

"now" --> the right now time
"exec_start" --> the time when the current process is put on the CPU
"delta_exec" --> the time difference of a process between it starts
and stops running on the CPU

When a process starts running before its VM is paused and the process
stops running after its VM is unpaused, the delta_exec will include
the VM suspend time which is pretty large compared to the real
execution time of a process.

This issue will make a great performance harm to the victim process.
If the process is an I/O-bound workload, its throughput and latency
will be influenced. If the process is a CPU-bound workload, this issue
will make its vruntime "unfair" compared to other processes under CFS.

Because the CPU resource of some type VMs in the cloud are limited as
the above describes(like Amazon EC2 t2.small instance), I doubt that
will also harm the performance of public cloud instances.
---------------------------------------


My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
all have this issue.

Please confirm this bug. Thanks.

--
Tony. S
Ph. D student of University of Colorado, Colorado Springs