Re: On guest free page hinting and OOM

From: Alexander Duyck
Date: Tue Apr 02 2019 - 13:30:46 EST


On Tue, Apr 2, 2019 at 8:56 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 02.04.19 17:04, Alexander Duyck wrote:
> > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 01.04.19 22:56, Alexander Duyck wrote:
> >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
> >>>>>> The interesting thing is most probably: Will the hinting size usually be
> >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
> >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
> >>>>>> something in the range of 1GB. But this is an interesting question to
> >>>>>> look into.
> >>>>>>
> >>>>>> Also, if the admin does not care about performance implications when
> >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
> >>>>>
> >>>>> "close to OOM" is what I meant.
> >>>>
> >>>> Problem is, host admin is the one adding memory. Guest admin is
> >>>> the one that knows about performance.
> >>>
> >>> The thing we have to keep in mind with this is that we are not dealing
> >>> with the same behavior as the balloon driver. We don't need to inflate
> >>> a massive hint and hand that off. Instead we can focus on performing
> >>> the hints on much smaller amounts and do it incrementally over time
> >>> with the idea being as the system sits idle it frees up more and more
> >>> of the inactive memory on the system.
> >>>
> >>> With that said, I still don't like the idea of us even trying to
> >>> target 1GB of RAM for hinting. I think it would be much better if we
> >>> stuck to smaller sizes and kept things down to a single digit multiple
> >>> of THP or higher order pages. Maybe something like 64MB of total
> >>> memory out for hinting.
> >>
> >> 1GB was just a number I came up with. But please note, as VCPUs hint in
> >> parallel, even though each request is only 64MB in size, things can sum up.
> >
> > Why do we need them running in parallel for a single guest? I don't
> > think we need the hints so quickly that we would need to have multiple
> > VCPUs running in parallel to provide hints. In addition as it
> > currently stands in order to get pages into and out of the buddy
> > allocator we are going to have to take the zone lock anyway so we
> > could probably just assume a single thread for pulling the memory,
> > placing it on the ring, and putting it back into the buddy allocator
> > after the hint has been completed.
>
> VCPUs hint when they think the time has come. Hinting in parallel comes
> naturally.

Actually it doesn't because if we are doing it asynchronously we are
having to pull pages out of the zone which requires the zone lock.
That has been one of the reasons why the patches from Nitesh start
dropping in performance when you start enabling more than 1 VCPU. If
we are limited by the zone lock it doesn't make sense for us to try to
do thing in parallel.

> >
> >>>
> >>> All we really would need to make it work would be to possibly look at
> >>> seeing if we can combine PageType values. Specifically what I would be
> >>> looking at is a transition that looks something like Buddy -> Offline
> >>> -> (Buddy | Offline). We would have to hold the zone lock at each
> >>> transition, but that shouldn't be too big of an issue. If we are okay
> >>> with possibly combining the Offline and Buddy types we would have a
> >>> way of tracking which pages have been hinted and which have not. Then
> >>> we would just have to have a thread running in the background on the
> >>> guest that is looking at the higher order pages and pulling 64MB at a
> >>> time offline, and when the hinting is done put them back in the "Buddy
> >>> | Offline" state.
> >>
> >> That approach may have other issues to solve (1 thread vs. many VCPUs,
> >> scanning all buddy pages over and over again) and other implications
> >> that might be undesirable (hints performed even more delayed, additional
> >> thread activity). I wouldn't call it the ultimate solution.
> >
> > So the problem with trying to provide the hint sooner is that you end
> > up creating a bottle-neck or you end up missing hints on pages
> > entirely and then have to fall back to such an approach. By just
> > letting the thread run in the background reporting the idle memory we
> > can avoid much of that.
> >
> > Also there isn't a huge priority to report idle memory in real time.
> > That would be kind of pointless as it might be pulled back out and
> > reused as soon as it is added. What we need is to give the memory a
> > bit of time to "cool" so that we aren't constantly hinting away memory
> > that is still in use.
>
> Depending on the setup, you don't want free memory lying around for too
> long in your guest.

Right, but you don't need it as soon as it is freed either. If it is
idle in the guest for a few seconds that shouldn't be an issue. The
free page hinting will hurt performance if we are doing it too often
simply because we are going to be triggering a much higher rate of
page faults.

> >
> >> Your approach sounds very interesting to play with, however
> >> at this point I would like to avoid throwing away Nitesh work once again
> >> to follow some other approach that looks promising. If we keep going
> >> like that, we'll spend another ~10 years working on free page hinting
> >> without getting anything upstream. Especially if it involves more
> >> core-MM changes. We've been there, we've done that. As long as the
> >> guest-host interface is generic enough, we can play with such approaches
> >> later in the guest. Important part is that the guest-host interface
> >> allows for that.
> >
> > I'm not throwing anything away. One of the issues in Nitesh's design
> > is that he is going to either miss memory and have to run an
> > asynchronous thread to clean it up after the fact, or he is going to
> > cause massive OOM errors and/or have to start halting VCPUs while
>
> 1. how are we going to miss memory. We are going to miss memory because
> we hint on very huge chunks, but we all agreed to live with that for now.

What I am talking about is that some application frees gigabytes of
memory. As I recall the queue length for a single cpu is only like 1G.
Are we going to be sitting on the backlog of most of system memory
while we process it 1G at a time?

> 2. What are the "massive OOM" errors you are talking about? We have the
> one scenario we described Nitesh was not even able to reproduce yet. And
> we have ways to mitigate the problem (discussed in this thread).

So I am referring to the last patch set I have seen. Last I knew all
the code was doing was assembling lists if isolated pages and placing
them on a queue. I have seen no way that this really limits the length
of the virtqueue, and the length of the isolated page lists is the
only thing that has any specific limits to it. So I see it easily
being possible for a good portion of memory being consumed by the
queue when you consider that what you have is essentially the maximum
length of the isolated page list multiplied by the number of entries
in a virtqueue.

> We have something that seems to work. Let's work from there instead of
> scrapping the general design once more, thinking "it is super easy". And
> yes, what you propose is pretty much throwing away the current design in
> the guest.

Define "work"? The last patch set required massive fixes as it was
causing kernel panics if more than 1 VCPU was enabled and list
corruption in general. I'm sure there are a ton more bugs lurking as
we have only begun to be able to stress this code in any meaningful
way.

For example what happens if someone sits on the mm write semaphore for
an extended period of time on the host? That will shut down all of the
hinting until that is released, and at that point once again any
hinting queues will be stuck on the guest until they can be processed
by the host.

> > waiting on the processing. All I am suggesting is that we can get away
> > from having to deal with both by just walking through the free pages
> > for the higher order and hinting only a few at a time without having
> > to try to provide the host with the hints on what is idle the second
> > it is freed.
> >
> >>>
> >>> I view this all as working not too dissimilar to how a standard Rx
> >>> ring in a network device works. Only we would want to allocate from
> >>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
> >>> the hint has been processed we would place them back in the "Buddy"
> >>> list with the "Offline" value still set. The only real changes needed
> >>> to the buddy allocator would be to add some logic for clearing/merging
> >>> the "Offline" setting as necessary, and to provide an allocator that
> >>> only works with non-"Offline" pages.
> >>
> >> Sorry, I had to smile at the phrase "only" in combination with "provide
> >> an allocator that only works with non-Offline pages" :) . I guess you
> >> realize yourself that these are core-mm changes that might easily be
> >> rejected upstream because "the virt guys try to teach core-MM yet
> >> another special case". I agree that this is nice to play with,
> >> eventually that approach could succeed and be accepted upstream. But I
> >> consider this long term work.
> >
> > The actual patch for this would probably be pretty small and compared
> > to some of the other stuff that has gone in recently isn't too far out
> > of the realm of possibility. It isn't too different then the code that
> > has already done in to determine the unused pages for virtio-balloon
> > free page hinting.
> >
> > Basically what we would be doing is providing a means for
> > incrementally transitioning the buddy memory into the idle/offline
> > state to reduce guest memory overhead. It would require one function
> > that would walk the free page lists and pluck out pages that don't
> > have the "Offline" page type set, a one-line change to the logic for
> > allocating a page as we would need to clear that extra bit of state,
> > and optionally some bits for how to handle the merge of two "Offline"
> > pages in the buddy allocator (required for lower order support). It
> > solves most of the guest side issues with the free page hinting in
> > that trying to do it via the arch_free_page path is problematic at
> > best since it was designed for a synchronous setup, not an
> > asynchronous one.
>
> This is throwing away work. No I don't think this is the right path to
> follow for now. Feel free to look into it while Nitesh gets something in
> shape we know conceptually works and we are starting to know which
> issues we are hitting.

Yes, it is throwing away work. But if the work is running toward a
dead end does it add any value?

I've been looking into the stuff Nitesh has been doing. I don't know
about others, but I have been testing it. That is why I provided the
patches I did to get it stable enough for me to test and address the
regressions it was causing. That is the source of some of my concern.
I think we have been making this overly complex with all the per-cpu
bits and trying to place this in the free path itself. We really need
to scale this back and look at having a single thread with a walker of
some sort just hinting on what memory is sitting in the buddy but not
hinted on. It is a solution that would work, even in a multiple VCPU
case, and is achievable in the short term.