Re: [PATCH] xen: core dom0 support

From: Jeremy Fitzhardinge
Date: Sat Mar 07 2009 - 04:06:44 EST


Ingo Molnar wrote:
Have i missed a mail of yours perhaps? I dont have any track of you having posted mmap-perf perfcounters results. I grepped my mbox and the last mail i saw from you containing the string "mmap-perf" is from January 20, and it only includes my numbers.


Yes, I think you must have missed a mail. I've attached it for reference, along with a more complete set of measurements I made regarding the series of patches applied (series ending at 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops performance.

My results showed a dramatic drop in cache references (from about 300% pvop vs non-pvop, down to 125% with the full set of patches applied), but it didn't seem to make much of an effect on the overall wallclock time. I'm a bit sceptical of the numbers here because, while each run's passes are fairly consistent, booting and remeasuring seemed to cause larger variations than we're looking at. It would be easy to handwave it away with "cache effects", but its not very satisfying.

I also didn't find the measurements very convincing because the number of CPU cycles and instructions executed count is effectively unchanged (ie, the baseline non-pvops vs original pvops apparently execute exactly the same number of instructions, but we know that there's a lot more going on), and with no change as each added patch definitely removes some amount of pvops overhead in terms of instructions in the instruction stream. Is it just measuring usermode stats? I ran it as root, with the command line you suggested ("./perfstat -e -5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up and down in a fairly non-intuitive way as well.

I'll do a rerun comparing current tip.git pvops vs non-pvops to see if I can get some better results.

J

Attachment: pvops-mmap-measurements.ods
Description: application/vnd.oasis.opendocument.spreadsheet

--- Begin Message --- Ingo Molnar wrote:
ping?

This is a very serious paravirt_ops slowdown affecting the native kernel's performance to the tune of 5-10% in certain workloads.

It's been about 2 years ago that paravirt_ops went upstream, when you told us that something like this would never happen, that paravirt_ops is designed so flexibly that it will never hinder the native kernel - and if it does it will be easy to fix it. Now is the time to fulfill that promise.

I couldn't exactly reproduce your results, but I guess they're similar in shape. Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 1-5). Interestingly I'm seeing identical instruction counts for pvops vs non-pvops, and a lower cycle count. The cache references are way up and the miss rate is up a bit, which I guess is the source of the slowdown.

With the attached patch, I get a clear improvement; it replaces the do-nothing pte_val/make_pte functions with inlined movs to move the argument to return, overpatching the 6-byte indirect call (on i386 it would just be all nopped out). CPU cycles and cache misses are way down, and the tick count is down from ~5% worse to ~2%. But the cache reference rate is even higher, which really doesn't make sense to me. But the patch is a clear improvement, and its hard to see how it could make anything worse (its always going to replace an indirect call with simple inlined code).

(Full numbers in spreadsheet.)

I have a couple of other patches to reduce the register pressure of the pvops calls, but I'm trying to work out how to make sure its not all to complex and/or fragile.

J

Attachment: pvops-mmap-measurements.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Subject: x86/pvops: add a paravirt_indent functions to allow special patching

Several paravirt ops implementations simply return their arguments,
the most obvious being the make_pte/pte_val class of operations on
native.

On 32-bit, the identity function is literally a no-op, as the calling
convention uses the same registers for the first argument and return.
On 64-bit, it can be implemented with a single "mov".

This patch adds special identity functions for 32 and 64 bit argument,
and machinery to recognize them and replace them with either nops or a
mov as appropriate.

At the moment, the only users for the identity functions are the
pagetable entry conversion functions.

The result is a measureable improvement on pagetable-heavy benchmarks
(2-3%, reducing the pvops overhead from 5 to 2%).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
---
arch/x86/include/asm/paravirt.h | 5 ++
arch/x86/kernel/paravirt.c | 75 ++++++++++++++++++++++++++++++-----
arch/x86/kernel/paravirt_patch_32.c | 12 +++++
arch/x86/kernel/paravirt_patch_64.c | 15 +++++++
4 files changed, 98 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -390,6 +390,8 @@
asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")

unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
unsigned paravirt_patch_ignore(unsigned len);
unsigned paravirt_patch_call(void *insnbuf,
const void *target, u16 tgt_clobbers,
@@ -1378,6 +1380,9 @@
}

void _paravirt_nop(void);
+u32 _paravirt_ident_32(u32);
+u64 _paravirt_ident_64(u64);
+
#define paravirt_nop ((void *)_paravirt_nop)

void paravirt_use_bytelocks(void);
===================================================================
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -44,6 +44,17 @@
{
}

+/* identity function, which can be inlined */
+u32 _paravirt_ident_32(u32 x)
+{
+ return x;
+}
+
+u64 _paravirt_ident_64(u64 x)
+{
+ return x;
+}
+
static void __init default_banner(void)
{
printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -138,9 +149,16 @@
if (opfunc == NULL)
/* If there's no function, patch it with a ud2a (BUG) */
ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
- else if (opfunc == paravirt_nop)
+ else if (opfunc == _paravirt_nop)
/* If the operation is a nop, then nop the callsite */
ret = paravirt_patch_nop();
+
+ /* identity functions just return their single argument */
+ else if (opfunc == _paravirt_ident_32)
+ ret = paravirt_patch_ident_32(insnbuf, len);
+ else if (opfunc == _paravirt_ident_64)
+ ret = paravirt_patch_ident_64(insnbuf, len);
+
else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
@@ -373,6 +391,45 @@
#endif
};

+typedef pte_t make_pte_t(pteval_t);
+typedef pmd_t make_pmd_t(pmdval_t);
+typedef pud_t make_pud_t(pudval_t);
+typedef pgd_t make_pgd_t(pgdval_t);
+
+typedef pteval_t pte_val_t(pte_t);
+typedef pmdval_t pmd_val_t(pmd_t);
+typedef pudval_t pud_val_t(pud_t);
+typedef pgdval_t pgd_val_t(pgd_t);
+
+
+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
+/* 32-bit pagetable entries */
+#define paravirt_native_make_pte (make_pte_t *)_paravirt_ident_32
+#define paravirt_native_pte_val (pte_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pmd (make_pmd_t *)_paravirt_ident_32
+#define paravirt_native_pmd_val (pmd_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pud (make_pud_t *)_paravirt_ident_32
+#define paravirt_native_pud_val (pud_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pgd (make_pgd_t *)_paravirt_ident_32
+#define paravirt_native_pgd_val (pgd_val_t *)_paravirt_ident_32
+#else
+/* 64-bit pagetable entries */
+#define paravirt_native_make_pte (make_pte_t *)_paravirt_ident_64
+#define paravirt_native_pte_val (pte_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pmd (make_pmd_t *)_paravirt_ident_64
+#define paravirt_native_pmd_val (pmd_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pud (make_pud_t *)_paravirt_ident_64
+#define paravirt_native_pud_val (pud_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pgd (make_pgd_t *)_paravirt_ident_64
+#define paravirt_native_pgd_val (pgd_val_t *)_paravirt_ident_64
+#endif
+
struct pv_mmu_ops pv_mmu_ops = {
#ifndef CONFIG_X86_64
.pagetable_setup_start = native_pagetable_setup_start,
@@ -424,21 +481,21 @@
.pmd_clear = native_pmd_clear,
#endif
.set_pud = native_set_pud,
- .pmd_val = native_pmd_val,
- .make_pmd = native_make_pmd,
+ .pmd_val = paravirt_native_pmd_val,
+ .make_pmd = paravirt_native_make_pmd,

#if PAGETABLE_LEVELS == 4
- .pud_val = native_pud_val,
- .make_pud = native_make_pud,
+ .pud_val = paravirt_native_pud_val,
+ .make_pud = paravirt_native_make_pud,
.set_pgd = native_set_pgd,
#endif
#endif /* PAGETABLE_LEVELS >= 3 */

- .pte_val = native_pte_val,
- .pgd_val = native_pgd_val,
+ .pte_val = paravirt_native_pte_val,
+ .pgd_val = paravirt_native_pgd_val,

- .make_pte = native_make_pte,
- .make_pgd = native_make_pgd,
+ .make_pte = paravirt_native_make_pte,
+ .make_pgd = paravirt_native_make_pgd,

.dup_mmap = paravirt_nop,
.exit_mmap = paravirt_nop,
===================================================================
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,18 @@
DEF_NATIVE(pv_cpu_ops, clts, "clts");
DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");

+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+ /* arg in %eax, return in %eax */
+ return 0;
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+ /* arg in %edx:%eax, return in %edx:%eax */
+ return 0;
+}
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{
===================================================================
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -19,6 +19,21 @@
DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl");
DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");

+DEF_NATIVE(, mov32, "mov %edi, %eax");
+DEF_NATIVE(, mov64, "mov %rdi, %rax");
+
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+ return paravirt_patch_insns(insnbuf, len,
+ start__mov32, end__mov32);
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+ return paravirt_patch_insns(insnbuf, len,
+ start__mov64, end__mov64);
+}
+
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
{

--- End Message ---