Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs
From: Michal Hocko
Date: Mon Jun 05 2023 - 03:56:44 EST
On Fri 02-06-23 15:57:59, Marcelo Tosatti wrote:
> The interruption caused by vmstat_update is undesirable
> for certain aplications:
>
> oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>
> The example above shows an additional 7us for the
>
> oslat -> kworker -> oslat
>
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold.
I personally find the above problem description insufficient. I have
asked several times and only got piece by piece information each time.
Maybe there is a reason to be secretive but it would be great to get at
least some basic expectations described and what they are based on.
E.g. workloads are running on isolated cpus with nohz full mode to
shield off any kernel interruption. Yet there are operations that update
counters (like mlock, but not mlock alone) that update per cpu counters
that will eventually get flushed and that will cause some interference.
Now the host/guest transition and intereference. How that happens when
the guest is running on an isolated and dedicated cpu?
> Skip periodic updates for nohz full CPUs. Any callers who
> need precise values should use a snapshot of the per-CPU
> counters, or use the global counters with measures to
> handle errors up to thresholds (see calculate_normal_threshold).
I would rephrase this paragraph.
In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with
an imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).
>From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference withtout critical
infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
for all isolated CPUs to avoid interference with the isolated workload.
> Suggested by Michal Hocko.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
>
> ---
>
> v2: use cpu_is_isolated (Michal Hocko)
>
> Index: linux-vmstat-remote/mm/vmstat.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmstat.c
> +++ linux-vmstat-remote/mm/vmstat.c
> @@ -28,6 +28,7 @@
> #include <linux/mm_inline.h>
> #include <linux/page_ext.h>
> #include <linux/page_owner.h>
> +#include <linux/sched/isolation.h>
>
> #include "internal.h"
>
> @@ -2022,6 +2023,16 @@ static void vmstat_shepherd(struct work_
> for_each_online_cpu(cpu) {
> struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>
> + /*
> + * Skip periodic updates for isolated CPUs.
> + * Any callers who need precise values should use
> + * a snapshot of the per-CPU counters, or use the global
> + * counters with measures to handle errors up to
> + * thresholds (see calculate_normal_threshold).
> + */
> + if (cpu_is_isolated(cpu))
> + continue;
> +
> if (!delayed_work_pending(dw) && need_update(cpu))
> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
>
--
Michal Hocko
SUSE Labs