Re: [announce] [patch] KVM paravirtualization for Linux

From: Avi Kivity
Date: Sun Jan 07 2007 - 07:21:22 EST


Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux under Linux), which includes support for the hardware cr3-cache feature of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the code is stable enough in my testing and i'd like to have some feedback.

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

native: 1.11
----------------------------------
Qemu: 61.18
KVM upstream: 53.01
KVM trunk: 6.36
KVM trunk+paravirt/cr3: 1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and is now quite close to native speed!

Very impressive! The gain probably comes not only from avoiding the vmentry/vmexit, but also from avoiding the flushing of the global page tlb entries.

"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

native: 0.25
----------------------------------
Qemu: 7.8
KVM upstream: 2.8
KVM trunk: 0.55
KVM paravirt/cr3: 0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

native: 0.9
----------------------------------
Qemu: 35.2
KVM upstream: 9.4
KVM trunk: 2.8
KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks are context-switching in this workload and the cr3 cache in current CPUs is only 4 entries.

This is a little too good to be true. Were both runs with the same KVM_NUM_MMU_PAGES?

I'm also concerned that at this point in time the cr3 optimizations will only show an improvement in microbenchmarks. In real life workloads a context switch is usually preceded by an I/O, and with the current sorry state of kvm I/O the context switch time would be dominated by the I/O time.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux guest and a Linux host. (this will be replaced with a proper
hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in Intel VMX CPUs, and extends KVM to make use of that cache. This feature allows the avoidance of expensive VM exits into hypervisor context. (The guest needs to be 'aware' and the cache has to be
shared between the guest and the hypervisor. So fully emulated OSs
wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO port delays do not cause a VM exit anymore, the i8259A IRQ controller code got simplified (this will be replaced with a proper, hypercall
based and host-maintained IRQ controller implementation) and TLB flushes are more efficient, because no cr3 reads happen which would otherwise cause a VM exit. These changes have a visible effect
already: they reduce qemu's CPU usage when a guest idles in HLT, by about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for now, this too is ad-hoc) - if that is passed then the KVM guest will probe for paravirtualization availability on the hypervisor side - and will use it if found. (If the guest does not find KVM-paravirt support on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not too many changes, the paravirt.c bits can be carried over to 64-bit almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo

Some comments below on the code.

+
+/*
+ * Special, register-to-cr3 instruction based hypercall API
+ * variant to the KVM host. This utilizes the cr3 filter capability
+ * of the hardware - if this works out then no VM exit happens,
+ * if a VM exit happens then KVM will get the virtual address too.
+ */
+static void kvm_write_cr3(unsigned long guest_cr3)
+{
+ struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
+ struct kvm_cr3_cache *cache = &para_state->cr3_cache;
+ int idx;
+
+ /*
+ * Check the cache (maintained by the host) for a matching
+ * guest_cr3 => host_cr3 mapping. Use it if found:
+ */
+ for (idx = 0; idx < cache->max_idx; idx++) {
+ if (cache->entry[idx].guest_cr3 == guest_cr3) {
+ /*
+ * Cache-hit: we load the cached host-CR3 value.
+ * This never causes any VM exit. (if it does then the
+ * hypervisor could do nothing with this instruction
+ * and the guest OS would be aborted)
+ */
+ asm volatile("movl %0, %%cr3"
+ : : "r" (cache->entry[idx].host_cr3));
+ goto out;
+ }
+ }
+
+ /*
+ * Cache-miss. Load the guest-cr3 value into cr3, which will
+ * cause a VM exit to the hypervisor, which then loads the
+ * host cr3 value and updates the cr3_cache.
+ */
+ asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
+out:
+ put_cpu_var(para_state);
+}

Well, you did say it was ad-hoc. For reference, this is how I see the hypercall API:

- A virtual pci device exports a page through the pci rom interface. The page contains the hypercall code approriate for the current cpu. This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the virtual->physical translation is much cheaper on the guest (__pa() vs a page table walk).



+
+/*
+ * Simplified i8259A controller handling:
+ */
+static void mask_and_ack_kvm(unsigned int irq)
+{
+ unsigned int irqmask = 1 << irq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&i8259A_lock, flags);
+ cached_irq_mask |= irqmask;
+
+ if (irq & 8) {
+ outb(cached_slave_mask, PIC_SLAVE_IMR);
+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' to master-IRQ2 */
+ } else {
+ outb(cached_master_mask, PIC_MASTER_IMR);
+ /* 'Specific EOI' to master: */
+ outb(0x60+irq, PIC_MASTER_CMD);
+ }
+ spin_unlock_irqrestore(&i8259A_lock, flags);
+}

Any reason this can't be applied to mainline? There's probably no downside to native, and it would benefit all virtualization solutions equally.

Index: linux/drivers/kvm/kvm.h
===================================================================
--- linux.orig/drivers/kvm/kvm.h
+++ linux/drivers/kvm/kvm.h
@@ -165,7 +165,7 @@ struct kvm_mmu {
int root_level;
int shadow_root_level;
- u64 *pae_root;
+ u64 *pae_root[KVM_CR3_CACHE_SIZE];

hmm. wouldn't it be simpler to have pae_root always point at the current root?


Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
static void mmu_free_roots(struct kvm_vcpu *vcpu)
{
- int i;
+ int i, j;
struct kvm_mmu_page *page;
#ifdef CONFIG_X86_64
@@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
return;
}
#endif
- for (i = 0; i < 4; ++i) {
- hpa_t root = vcpu->mmu.pae_root[i];
+ /*
+ * Skip to the next cr3 filter entry and free it (if it's occupied):
+ */
+ vcpu->cr3_cache_idx++;
+ if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
+ vcpu->cr3_cache_idx = 0;
- ASSERT(VALID_PAGE(root));
- root &= PT64_BASE_ADDR_MASK;
- page = page_header(root);
- --page->root_count;
- vcpu->mmu.pae_root[i] = INVALID_PAGE;
+ j = vcpu->cr3_cache_idx;
+ /*
+ * Clear the guest-visible entry:
+ */
+ if (vcpu->para_state) {
+ vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
+ vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
+ }
+ ASSERT(vcpu->mmu.pae_root[j]);
+ if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
+ vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
+ for (i = 0; i < 4; ++i) {
+ hpa_t root = vcpu->mmu.pae_root[j][i];
+
+ ASSERT(VALID_PAGE(root));
+ root &= PT64_BASE_ADDR_MASK;
+ page = page_header(root);
+ --page->root_count;
+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
+ }
}
vcpu->mmu.root_hpa = INVALID_PAGE;
}

You keep the page directories pinned here. This can be a problem if a guest frees a page directory, and then starts using it as a regular page. kvm sometimes chooses not to emulate a write to a guest page table, but instead to zap it, which is impossible when the page is freed. You need to either unpin the page when that happens, or add a hypercall to let kvm know when a page directory is freed.

There is also a minor problem that changes to the pgd aren't caught by kvm. It doesn't hurt much as this is PV and we can relax the guest/host contract a little.



static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
{
struct page *page;
- int i;
+ int i, j;
ASSERT(vcpu);
@@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
++vcpu->kvm->n_free_mmu_pages;
}
- /*
- * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
- * Therefore we need to allocate shadow page tables in the first
- * 4GB of memory, which happens to fit the DMA32 zone.
- */
- page = alloc_page(GFP_KERNEL | __GFP_DMA32);
- if (!page)
- goto error_1;
- vcpu->mmu.pae_root = page_address(page);
- for (i = 0; i < 4; ++i)
- vcpu->mmu.pae_root[i] = INVALID_PAGE;
+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
+ /*
+ * When emulating 32-bit mode, cr3 is only 32 bits even on
+ * x86_64. Therefore we need to allocate shadow page tables
+ * in the first 4GB of memory, which happens to fit the DMA32
+ * zone:
+ */
+ page = alloc_page(GFP_KERNEL | __GFP_DMA32);
+ if (!page)
+ goto error_1;
+
+ ASSERT(!vcpu->mmu.pae_root[j]);
+ vcpu->mmu.pae_root[j] = page_address(page);
+ for (i = 0; i < 4; ++i)
+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
+ }

Since a pae root uses just 32 bytes, you can store all cache entries in a single page. Not that it matters much.

#define ENABLE_INTERRUPTS_SYSEXIT \
Index: linux/include/linux/kvm.h
===================================================================
--- linux.orig/include/linux/kvm.h
+++ linux/include/linux/kvm.h
@@ -238,4 +238,44 @@ struct kvm_dirty_log {
#define KVM_DUMP_VCPU _IOW(KVMIO, 250, int /* vcpu_slot */)
+
+#define KVM_CR3_CACHE_SIZE 4
+
+struct kvm_cr3_cache_entry {
+ u64 guest_cr3;
+ u64 host_cr3;
+};
+
+struct kvm_cr3_cache {
+ struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
+ u32 max_idx;
+};
+
+/*
+ * Per-VCPU descriptor area shared between guest and host. Writable to
+ * both guest and host. Registered with the host by the guest when
+ * a guest acknowledges paravirtual mode.
+ */
+struct kvm_vcpu_para_state {
+ /*
+ * API version information for compatibility. If there's any support
+ * mismatch (too old host trying to execute too new guest) then
+ * the host will deny entry into paravirtual mode. Any other
+ * combination (new host + old guest and new host + new guest)
+ * is supposed to work - new host versions will support all old
+ * guest API versions.
+ */
+ u32 guest_version;
+ u32 host_version;
+ u32 size;
+ u32 __pad_00;
+
+ struct kvm_cr3_cache cr3_cache;
+
+} __attribute__ ((aligned(PAGE_SIZE)));
+
+#define KVM_PARA_API_VERSION 1
+
+#define KVM_API_MAGIC 0x87654321
+

<linux/kvm.h> is the vmm userspace interface. The guest/host interface should probably go somewhere else.

--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/