Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

From: Christian Borntraeger
Date: Tue May 18 2021 - 12:36:19 EST




On 18.05.21 18:34, Claudio Imbrenda wrote:
On Tue, 18 May 2021 18:20:22 +0200
Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote:

On 18.05.21 18:13, Claudio Imbrenda wrote:
On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote:
On 18.05.21 17:36, Claudio Imbrenda wrote:
On Tue, 18 May 2021 17:05:37 +0200
Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
On Mon, 17 May 2021 22:07:47 +0200
Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> wrote:
Previously, when a protected VM was rebooted or when it was shut
down, its memory was made unprotected, and then the protected VM
itself was destroyed. Looping over the whole address space can
take some time, considering the overhead of the various
Ultravisor Calls (UVCs). This means that a reboot or a shutdown
would take a potentially long amount of time, depending on the
amount of used memory.

This patchseries implements a deferred destroy mechanism for
protected guests. When a protected guest is destroyed, its
memory is cleared in background, allowing the guest to restart
or terminate significantly faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased,
and then a background thread is started to clean up. Once the
thread went through the whole address space, the protected VM is
actually destroyed.

For the shutdown case, a list of pages to be destroyed is formed
when the mm is torn down. Instead of just unmapping the pages
when the address space is being torn down, they are also set
aside. Later when KVM cleans up the VM, a thread is started to
clean up the pages from the list.

Just to make sure, 'clean up' includes doing uv calls?

yes

This means that the same address space can have memory belonging
to more than one protected guest, although only one will be
running, the others will in fact not even have any CPUs.

Are those set-aside-but-not-yet-cleaned-up pages still possibly
accessible in any way? I would assume that they only belong to
the

in case of reboot: yes, they are still in the address space of the
guest, and can be swapped if needed
'zombie' guests, and any new or rebooted guest is a new entity
that needs to get new pages?

the rebooted guest (normal or secure) will re-use the same pages
of the old guest (before or after cleanup, which is the reason of
patches 3 and 4)

the KVM guest is not affected in case of reboot, so the userspace
address space is not touched.
Can too many not-yet-cleaned-up pages lead to a (temporary)
memory exhaustion?

in case of reboot, not much; the pages were in use are still in
use after the reboot, and they can be swapped.

in case of a shutdown, yes, because the pages are really taken
aside and cleared/destroyed in background. they cannot be
swapped. they are freed immediately as they are processed, to try
to mitigate memory exhaustion scenarios.

in the end, this patchseries is a tradeoff between speed and
memory consumption. the memory needs to be cleared up at some
point, and that requires time.

in cases where this might be an issue, I introduced a new KVM flag
to disable lazy destroy (patch 10)

Maybe we could piggy-back on the OOM-kill notifier and then fall
back to synchronous freeing for some pages?

I'm not sure I follow

once the pages have been set aside, it's too late

while the pages are being set aside, every now and then some memory
needs to be allocated. the allocation is atomic, not allowed to use
emergency reserves, and can fail without warning. if the allocation
fails, we clean up one page and continue, without setting aside
anything (patch 9)

so if the system is low on memory, the lazy destroy should not make
the situation too much worse.

the only issue here is starting a normal process in the host (maybe
a non secure guest) that uses a lot of memory very quickly, right
after a large secure guest has terminated.

I think page cache page allocations do not need to be atomic.
In that case the kernel might stil l decide to trigger the oom
killer. We can let it notify ourselves free 256 pages synchronously
and avoid the oom kill. Have a look at the virtio-balloon
virtio_balloon_oom_notify

the issue is that once the pages have been set aside, it's too late.
the OOM notifier would only be useful if we get notified of the OOM
situation _while_ setting aside the pages.

unless you mean that the notifier should simply wait until the thread
has done (some of) its work?

Exactly. Let the notifier wait until you have freed 256pages and return
256 to the oom notifier.