Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

From: Marcelo Tosatti
Date: Fri May 19 2017 - 10:34:48 EST


Hi Christoph,

On Tue, May 16, 2017 at 08:37:11AM -0500, Christoph Lameter wrote:
> On Mon, 15 May 2017, Marcelo Tosatti wrote:
>
> > > NOHZ already does that. I wanted to know what your problem is that you
> > > see. The latency issue has already been solved as far as I can tell .
> > > Please tell me why the existing solutions are not sufficient for you.
> >
> > We don't want vmstat_worker to execute on a given CPU, even if the local
> > CPU updates vm-statistics.
>
> Instead of responding you repeat describing what you want.
>
> > Because:
> >
> > vmstat_worker increases latency of the application
> > (i can measure it if you want on a given CPU,
> > how many ns's the following takes:
>
> That still is no use case.

Use-case: realtime application on an isolated core which for some reason
updates vmstatistics.

> Just a measurement of vmstat_worker. Pointless.

Shouldnt the focus be on general scenarios rather than particular
usecases, so that the solution covers a wider range of usecases?

The situation as i see is as follows:

Your point of view is: an "isolated CPU" with a set of applications
cannot update vm statistics, otherwise they pay the vmstat_update cost:

kworker/5:1-245 [005] ....1.. 673.454295: workqueue_execute_start: work struct ffffa0cf6e493e20: function vmstat_update
kworker/5:1-245 [005] ....1.. 673.454305: workqueue_execute_end: work struct ffffa0cf6e493e20

Thats 10us for example.

So if want to customize a realtime setup whose code updates vmstatistic,
you are dead. You have to avoid any systemcall which possibly updates
vmstatistics (now and in the future kernel versions).

> If you move the latency from the vmstat worker into the code thats
> updating the counters then you will require increased use of atomics
> which will increase contention which in turn will significantly
> increase the overall latency.

The point is that these vmstat updates are rare. From
http://www.7-cpu.com/cpu/Haswell.html:

RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)

Lets round to 100ns = 0.1us.

You need 100 vmstat updates (all misses to RAM, the worst possible case)
to have equivalent amount of time of the batching version.

With more than 100 vmstat updates, then the batching is more efficient
(as in total amount of time to transfer to global counters).

But thats not the point. The point is the 10us interruption
to execution of the realtime app (which can either mean
your current deadline requirements are not met, or that
another application with lowest latency requirement can't
be used).

So i'd rather spend more time updating the aggregate of vmstatistics
(with the local->global transfer taking a small amount of time,
therefore not interrupting the realtime application for a long period),
than to batch the updates (which increases overall performance beyond
a certain number of updates, but which is _ONE_ large interruption).

So lets assume i go and count the vmstat updates on the DPDK case
(or any other realtime app), batching is more efficient
for that case.

Still, the one-time interruption of batching is worse than less
efficient one bean at a time vmstatistics accounting.

No?

Also, you could reply that: "oh, there are no vmstat updates
in fact in this setup, but the logic of disabling vmstat_update
is broken". Lets assume thats the case.

Even if its fixed (vmstat_update properly shut down) the proposed patch
deals with both cases: no vmstat updates on isolated cpus, and vmstat
updates on isolated cpus.

So why are you against integrating this simple, isolated patch which
does not affect how current logic works?