Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages

From: David Hildenbrand
Date: Mon Feb 11 2019 - 04:28:50 EST

On 10.02.19 01:38, Michael S. Tsirkin wrote:
> On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote:
>> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote:
>>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
>>>>>> However I am still thinking about a workload which I can use to test its
>>>>>> effectiveness.
>>>>> You might want to look at doing something like min(MAX_ORDER - 1,
>>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
>>>>> THP which is the most likely to be used page size with the guest.
>>>> Sure, thanks for the suggestion.
>>> Given current hinting in balloon is MAX_ORDER I'd say
>>> share code. If you feel a need to adjust down the road,
>>> adjust both of them with actual testing showing gains.
>> Actually I'm left kind of wondering why we are even going through
>> virtio-balloon for this?
> Just look at what does it do.
> It improves memory overcommit if guests are cooperative, and it does
> this by giving the hypervisor addresses of pages which it can discard.
> It's just *exactly* like the balloon with all the same limitations.

I agree, this belongs to virtio-balloon *unless* we run into real
problems implementing it via an asynchronous mechanism.

>> It seems like this would make much more sense
>> as core functionality of KVM itself for the specific architectures
>> rather than some side thing.

Whatever can be handled in user space and does not have significant
performance impacts should be handled in user space. If we run into real
problems with that approach, fair enough. (e.g. vcpu yielding is a good
example where an implementation in KVM makes sense, not going via QEMU)

> Well same as balloon: whether it's useful to you at all
> would very much depend on your workloads.
> This kind of cooperative functionality is good for co-located
> single-tenant VMs. That's pretty niche. The core things in KVM
> generally don't trust guests.
>> In addition this could end up being
>> redundant when you start getting into either the s390 or PowerPC
>> architectures as they already have means of providing unused page
>> hints.

I'd like to note that on s390x the functionality is not provided when
running nested guests. And there are real problems getting it ever
supported. (see description below how it works on s390x, the issue for
nested guests are the bits in the guest -> host page tables we cannot
support for nested guests).

Hinting only works for guests running one level under LPAR (with a
recent machine), but not nested guests.

(LPAR -> KVM1 works, LPAR - KVM1 -> KVM2 foes not work for the latter)

So an implementation for s390 would still make sense for this scenario.

> Interesting. Is there host support in kvm?

On s390x there is. It works on page granularity and synchronization
between guest/host ("don't drop a page in the host while the guest is
reusing it") is done via special bits in the host->guest page table.
Instructions in the guest are able to modify these bits. A guest can
configure a "usage state" of it's backed PTEs. E.g. "unused" or "stable".

Whenever a page in the guest is freed/reused, the ESSA instruction is
triggered in the guest. It will modify the page table bits and add the
guest phyical pfn to a buffer in the host. Once that buffer is full,
ESSA will trigger an intercept to the hypervisor. Here, all these
"unused" pages can be zapped.

Also, when swapping a page out in the hypervisor, if it was masked by
the guest as unused or logically zero, instead of swapping out the page,
it can simply be dropped and a fresh zero page can be supplied when the
guest tries to access it.

"ESSA" is implemented in KVM in arch/s390/kvm/priv.c:handle_essa().

So on s390x, it works because the synchronization with the hypervisor is
directly built into hw vitualization support (guest->host page tables +
instruction) and ESSA will not intercept on every call (due to the buffer).
>> I have a set of patches I proposed that add similar functionality via
>> a KVM hypercall for x86 instead of doing it as a part of a Virtio
>> device[1]. I'm suspecting the overhead of doing things this way is
>> much less then having to make multiple madvise system calls from QEMU
>> back into the kernel.
> Well whether it's a virtio device is orthogonal to whether it's an
> madvise call, right? You can build vhost-pagehint and that can
> handle requests in a VQ within balloon and do it
> within host kernel directly.
> virtio rings let you pass multiple pages so it's really hard to
> say which will win outright - maybe it's more important
> to coalesce exits.

We don't know until we measure it.



David / dhildenb