Re: [PATCH v1 3/3] virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM

From: Michael S. Tsirkin
Date: Sun Feb 16 2020 - 04:46:27 EST

On Fri, Feb 14, 2020 at 12:48:42PM -0800, Tyler Sanderson wrote:
> Regarding Wei's patch that modifies the shrinker implementation, versus this
> patch which reverts to OOM notifier:
> I am in favor of both patches. But I do want to make sure a fix gets back
> ported to 4.19 where the performance regression was first introduced.
> My concern with reverting to the OOM notifier is, as mst@ put it (in the other
> thread):
> "when linux hits OOM all kind of error paths are being hit, latent bugs start
> triggering, latency goes up drastically."
> The guest could be in a lot of pain before the OOM notifier is invoked, and it
> seems like the shrinker API might allow more fine grained control of when we
> deflate.
> On the other hand, I'm not totally convinced that Wei's patch is an expected
> use of the shrinker/page-cache APIs, and maybe it is fragile. Needs more
> testing and scrutiny.
> It seems to me like the shrinker API is the right API in the long run, perhaps
> with some fixes and modifications. But maybe reverting to OOM notifier is the
> best patch to back port?

In that case can I see some Tested-by reports pls?

> On Fri, Feb 14, 2020 at 6:19 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >> There was a report that this results in undesired side effects when
> >> inflating the balloon to shrink the page cache. [1]
> >>      "When inflating the balloon against page cache (i.e. no free memory
> >>       remains) vmscan.c will both shrink page cache, but also invoke the
> >>       shrinkers -- including the balloon's shrinker. So the balloon
> >>       driver allocates memory which requires reclaim, vmscan gets this
> >>       memory by shrinking the balloon, and then the driver adds the
> >>       memory back to the balloon. Basically a busy no-op."
> >>
> >> The name "deflate on OOM" makes it pretty clear when deflation should
> >> happen - after other approaches to reclaim memory failed, not while
> >> reclaiming. This allows to minimize the footprint of a guest - memory
> >> will only be taken out of the balloon when really needed.
> >>
> >> Especially, a drop_slab() will result in the whole balloon getting
> >> deflated - undesired.
> >
> > Could you explain why some more? drop_caches shouldn't be really used in
> > any production workloads and if somebody really wants all the cache to
> > be dropped then why is balloon any different?
> >
> Deflation should happen when the guest is out of memory, not when
> somebody thinks it's time to reclaim some memory. That's what the
> feature promised from the beginning: Only give the guest more memory in
> case it *really* needs more memory.
> Deflate on oom, not deflate on reclaim/memory pressure. (that's what the
> report was all about)
> A priority for shrinkers might be a step into the right direction.
> --
> Thanks,
> David / dhildenb