Re: [LSF/MM TOPIC] VM containers

From: Johannes Weiner
Date: Wed Jan 27 2016 - 13:41:36 EST

On Wed, Jan 27, 2016 at 06:48:31PM +0300, Vladimir Davydov wrote:
> On Fri, Jan 22, 2016 at 12:11:21PM -0500, Johannes Weiner wrote:
> > Hi,
> >
> > On Fri, Jan 22, 2016 at 10:56:15AM -0500, Rik van Riel wrote:
> > > I am trying to gauge interest in discussing VM containers at the LSF/MM
> > > summit this year. Projects like ClearLinux, Qubes, and others are all
> > > trying to use virtual machines as better isolated containers.
> > >
> > > That changes some of the goals the memory management subsystem has,
> > > from "use all the resources effectively" to "use as few resources as
> > > necessary, in case the host needs the memory for something else".
> >
> > I would be very interested in discussing this topic, because I think
> > the issue is more generic than these VM applications. We are facing
> > the same issues with regular containers, where aggressive caching is
> > counteracting the desire to cut down workloads to their bare minimum
> > in order to pack them as tightly as possible.
> >
> > With per-cgroup LRUs and thrash detection, we have infrastructure in
> By thrash detection, do you mean vmpressure?

I mean mm/workingset.c, we'd have to look at actual refaults.

Reclaim efficiency is not a meaningful measure of memory pressure. You
could be reclaiming happily and successfully every single cache page
on the LRU, only to have userspace fault them in again right after.
No memory pressure would be detected, even though a ton of IO is
caused by a lack of memory. [ For this reason, I think we should phase
out reclaim effifiency as a metric throughout the VM - vmpressure, LRU
type balancing, OOM invocation etc. - and base it all on thrashing. ]

> > place that could allow us to accomplish this. Right now we only enter
> > reclaim once memory runs out, but we could add an allocation mode that
> > would prefer to always reclaim from the local LRU before increasing
> > the memory footprint, and only expand once we detect thrashing in the
> > page cache. That would keep the workloads neatly trimmed at all times.
> I don't get it. Do you mean a sort of special GFP flag that would force
> the caller to reclaim before actual charging/allocation? Or is it
> supposed to be automatic, basing on how memcg is behaving? If the
> latter, I suppose it could be already done by a userspace daemon by
> adjusting memory.high as needed, although it's unclear how to do it
> optimally.

Yes, essentially we'd have a target footprint that we increase only
when cache refaults (or swapins) are detected.

This could be memory.high and a userspace daemon.

We could also put it in the kernel so it's useful out of the box.

It could be a watermark for the page allocator to work in virtualized

> > For virtualized environments, the thrashing information would be
> > communicated slightly differently to the page allocator and/or the
> > host, but otherwise the fundamental principles should be the same.
> >
> > We'd have to figure out how to balance the aggressiveness there and
> > how to describe this to the user, as I can imagine that users would
> > want to tune this based on a tolerance for the degree of thrashing: if
> > pages are used every M ms, keep them cached; if pages are used every N
> > ms, freeing up the memory and refetching them from disk is better etc.
> Sounds reasonable. What about adding a parameter to memcg that would
> define ws access time? So that it would act just like memory.low, but in
> terms of lruvec age instead of lruvec size. I mean, we keep track of
> lruvec ages and scan those lruvecs whose age is > ws access time before
> others. That would protect those workloads that access their ws quite,
> but not very often from streaming workloads which can generate a lot of
> useless pressure.

I'm not following here. Which lruvec age?