Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs

From: Marcelo Tosatti
Date: Mon Jun 05 2023 - 11:47:45 EST

Next message: Marcelo Tosatti: "Re: [PATCH v2 3/3] mm/vmstat: do not refresh stats for nohz_full CPUs"
Previous message: Peter Xu: "Re: [PATCH v2 03/11] selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c"
In reply to: Michal Hocko: "Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs"
Next in thread: Michal Hocko: "Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 05, 2023 at 09:55:57AM +0200, Michal Hocko wrote:
> On Fri 02-06-23 15:57:59, Marcelo Tosatti wrote:
> > The interruption caused by vmstat_update is undesirable
> > for certain aplications:
> >
> > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> >
> > The example above shows an additional 7us for the
> >
> > oslat -> kworker -> oslat
> >
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold.
>
> I personally find the above problem description insufficient. I have
> asked several times and only got piece by piece information each time.
> Maybe there is a reason to be secretive but it would be great to get at
> least some basic expectations described and what they are based on.

There is no reason to be secretive.

>
> E.g. workloads are running on isolated cpus with nohz full mode to
> shield off any kernel interruption. Yet there are operations that update
> counters (like mlock, but not mlock alone) that update per cpu counters
> that will eventually get flushed and that will cause some interference.
> Now the host/guest transition and intereference. How that happens when
> the guest is running on an isolated and dedicated cpu?

Follows the updated changelog. Does it contain the information
requested ?

----

Performance details for the kworker interruption:

With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).

oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.

The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:

CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback
CPU 11/KVM-9540 [001] dNh1. 2314.248585: <stack trace>
=> 0xffffffffc042b083
=> mod_zone_page_state
=> __folio_end_writeback
=> folio_end_writeback
=> iomap_finish_ioend
=> blk_mq_end_request_batch
=> nvme_irq
=> __handle_irq_event_percpu
=> handle_irq_event
=> handle_edge_irq
=> __common_interrupt
=> common_interrupt
=> asm_common_interrupt
=> vmx_do_interrupt_nmi_irqoff
=> vmx_handle_exit_irqoff
=> vcpu_enter_guest
=> vcpu_run
=> kvm_arch_vcpu_ioctl_run
=> kvm_vcpu_ioctl
=> __x64_sys_ioctl
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

> > Skip periodic updates for nohz full CPUs. Any callers who
> > need precise values should use a snapshot of the per-CPU
> > counters, or use the global counters with measures to
> > handle errors up to thresholds (see calculate_normal_threshold).
>
> I would rephrase this paragraph.
> In kernel users of vmstat counters either require the precise value and
> they are using zone_page_state_snapshot interface or they can live with
> an imprecision as the regular flushing can happen at arbitrary time and
> cumulative error can grow (see calculate_normal_threshold).

> >From that POV the regular flushing can be postponed for CPUs that have
> been isolated from the kernel interference withtout critical
> infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
> for all isolated CPUs to avoid interference with the isolated workload.
>
> > Suggested by Michal Hocko.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
>
> Acked-by: Michal Hocko <mhocko@xxxxxxxx>

OK, updated comment, thanks.

Next message: Marcelo Tosatti: "Re: [PATCH v2 3/3] mm/vmstat: do not refresh stats for nohz_full CPUs"
Previous message: Peter Xu: "Re: [PATCH v2 03/11] selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c"
In reply to: Michal Hocko: "Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs"
Next in thread: Michal Hocko: "Re: [PATCH v2 2/3] vmstat: skip periodic vmstat update for nohz full CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]