Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages

From: Alexander Duyck
Date: Fri Mar 08 2019 - 14:25:46 EST


On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote:
>
>
> On 3/8/19 1:06 PM, Alexander Duyck wrote:
> > On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote:
> >>> The only other thing I still want to try and see if I can do is to add
> >>> a jiffies value to the page private data in the case of the buddy
> >>> pages.
> >> Actually there's one extra thing I think we should do, and that is make
> >> sure we do not leave less than X% off the free memory at a time.
> >> This way chances of triggering an OOM are lower.
> > If nothing else we could probably look at doing a watermark of some
> > sort so we have to have X amount of memory free but not hinted before
> > we will start providing the hints. It would just be a matter of
> > tracking how much memory we have hinted on versus the amount of memory
> > that has been pulled from that pool.
> This is to avoid false OOM in the guest?

Partially, though it would still be possible. Basically it would just
be a way of determining when we have hinted "enough". Basically it
doesn't do us much good to be hinting on free memory if the guest is
already constrained and just going to reallocate the memory shortly
after we hinted on it. The idea is with a watermark we can avoid
hinting until we start having pages that are actually going to stay
free for a while.

> > It is another reason why we
> > probably want a bit in the buddy pages somewhere to indicate if a page
> > has been hinted or not as we can then use that to determine if we have
> > to account for it in the statistics.
>
> The one benefit which I can see of having an explicit bit is that it
> will help us to have a single hook away from the hot path within buddy
> merging code (just like your arch_merge_page) and still avoid duplicate
> hints while releasing pages.
>
> I still have to check PG_idle and PG_young which you mentioned but I
> don't think we can reuse any existing bits.

Those are bits that are already there for 64b. I think those exist in
the page extension for 32b systems. If I am not mistaken they are only
used in VMA mapped memory. What I was getting at is that those are the
bits we could think about reusing.

> If we really want to have something like a watermark, then can't we use
> zone->free_pages before isolating to see how many free pages are there
> and put a threshold on it? (__isolate_free_page() does a similar thing
> but it does that on per request basis).

Right. That is only part of it though since that tells you how many
free pages are there. But how many of those free pages are hinted?
That is the part we would need to track separately and then then
compare to free_pages to determine if we need to start hinting on more
memory or not.

> >
> >>> With that we could track the age of the page so it becomes
> >>> easier to only target pages that are truly going cold rather than
> >>> trying to grab pages that were added to the freelist recently.
> >> I like that but I have a vague memory of discussing this with Rik van
> >> Riel and him saying it's actually better to take away recently used
> >> ones. Can't see why would that be but maybe I remember wrong. Rik - am I
> >> just confused?
> > It is probably to cut down on the need for disk writes in the case of
> > swap. If that is the case it ends up being a trade off.
> >
> > The sooner we hint the less likely it is that we will need to write a
> > given page to disk. However the sooner we hint, the more likely it is
> > we will need to trigger a page fault and pull back in a zero page to
> > populate the last page we were working on. The sweet spot will be that
> > period of time that is somewhere in between so we don't trigger
> > unnecessary page faults and we don't need to perform additional swap
> > reads/writes.
> --
> Regards
> Nitesh
>