Re: [PATCH] writeback: sum memcg dirty counters as needed

From: Greg Thelen
Date: Wed Mar 27 2019 - 18:30:01 EST


On Fri, Mar 22, 2019 at 11:15 AM Roman Gushchin <guro@xxxxxx> wrote:
>
> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
> > Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> > memory.stat reporting") memcg dirty and writeback counters are managed
> > as:
> > 1) per-memcg per-cpu values in range of [-32..32]
> > 2) per-memcg atomic counter
> > When a per-cpu counter cannot fit in [-32..32] it's flushed to the
> > atomic. Stat readers only check the atomic.
> > Thus readers such as balance_dirty_pages() may see a nontrivial error
> > margin: 32 pages per cpu.
> > Assuming 100 cpus:
> > 4k x86 page_size: 13 MiB error per memcg
> > 64k ppc page_size: 200 MiB error per memcg
> > Considering that dirty+writeback are used together for some decisions
> > the errors double.
> >
> > This inaccuracy can lead to undeserved oom kills. One nasty case is
> > when all per-cpu counters hold positive values offsetting an atomic
> > negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
> > balance_dirty_pages() only consults the atomic and does not consider
> > throttling the next n_cpu*32 dirty pages. If the file_lru is in the
> > 13..200 MiB range then there's absolutely no dirty throttling, which
> > burdens vmscan with only dirty+writeback pages thus resorting to oom
> > kill.
> >
> > It could be argued that tiny containers are not supported, but it's more
> > subtle. It's the amount the space available for file lru that matters.
> > If a container has memory.max-200MiB of non reclaimable memory, then it
> > will also suffer such oom kills on a 100 cpu machine.
> >
> > The following test reliably ooms without this patch. This patch avoids
> > oom kills.
> >
> > ...
> >
> > Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
> > collect exact per memcg counters when a memcg is close to the
> > throttling/writeback threshold. This avoids the aforementioned oom
> > kills.
> >
> > This does not affect the overhead of memory.stat, which still reads the
> > single atomic counter.
> >
> > Why not use percpu_counter? memcg already handles cpus going offline,
> > so no need for that overhead from percpu_counter. And the
> > percpu_counter spinlocks are more heavyweight than is required.
> >
> > It probably also makes sense to include exact dirty and writeback
> > counters in memcg oom reports. But that is saved for later.
> >
> > Signed-off-by: Greg Thelen <gthelen@xxxxxxxxxx>
> > ---
> > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++--------
> > mm/memcontrol.c | 26 ++++++++++++++++++++------
> > mm/page-writeback.c | 27 +++++++++++++++++++++------
> > 3 files changed, 66 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83ae11cbd12c..6a133c90138c 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
> > return x;
> > }
>
> Hi Greg!
>
> Thank you for the patch, definitely a good problem to be fixed!
>
> >
> > +/* idx can be of type enum memcg_stat_item or node_stat_item */
> > +static inline unsigned long
> > +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
> > +{
> > + long x = atomic_long_read(&memcg->stat[idx]);
> > +#ifdef CONFIG_SMP
>
> I doubt that this #ifdef is correct without corresponding changes
> in __mod_memcg_state(). As now, we do use per-cpu buffer which spills
> to an atomic value event if !CONFIG_SMP. It's probably something
> that we want to change, but as now, #ifdef CONFIG_SMP should protect
> only "if (x < 0)" part.

Ack. I'll fix it.

> > + int cpu;
> > +
> > + for_each_online_cpu(cpu)
> > + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
> > + if (x < 0)
> > + x = 0;
> > +#endif
> > + return x;
> > +}
>
> Also, isn't it worth it to generalize memcg_page_state() instead?
> By adding an bool exact argument? I believe dirty balance is not
> the only place, where we need a better accuracy.

Nod. I'll provide a more general version of memcg_page_state(). I'm
testing updated (forthcoming v2) patch set now with feedback from
Andrew and Roman.