Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

From: Boris Ostrovsky
Date: Wed Aug 16 2017 - 17:43:57 EST


On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> writes:
>
>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>>
>>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>>>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>>>>> would make it work again), but this was some years ago and I cannot
>>>>> readily find those emails.
>>>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>>>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>>>> because we just wanted to use the generic page table lookup code on
>>>> x86 *despite* not using RCU freeing.
>>>>
>>>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>>>>
>>>> There was only passing mention of maybe making x86 use RCU, but the
>>>> discussion was really about why the IF flag meant that x86 didn't need
>>>> to, iirc.
>>>>
>>>> I don't recall us ever discussing *really* making x86 use RCU.
>>> Google finds me this:
>>>
>>> https://lwn.net/Articles/500188/
>>>
>>> Which includes:
>>>
>>> http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg72918.html
>>>
>>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
>>> PARAVIRT_TLB_FLUSH.
>>>
>>> But yes, this is very much virt specific nonsense, native would never
>>> need this.
>> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
>> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
>> do you think about the required testing? Any suggestion for a
>> specifically crafted micro benchmark in addition to standard
>> ebizzy/kernbench/...?
> In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
> patch I used is attached; I know that it breaks other architectures) on
> bare metal with PARAVIRT enabled in config. The results are:
>
> 6-CPU host:
>
> Average Half load -j 3 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 400.498 (0.179679) Elapsed Time 399.909 (0.162853)
> User Time 1098.72 (0.278536) User Time 1097.59 (0.283894)
> System Time 100.301 (0.201629) System Time 99.736 (0.196254)
> Percent CPU 299 (0) Percent CPU 299 (0)
> Context Switches 5774.1 (69.2121) Context Switches 5744.4 (79.4162)
> Sleeps 87621.2 (78.1093) Sleeps 87586.1 (99.7079)
>
> Average Optimal load -j 24 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 219.03 (0.652534) Elapsed Time 218.959 (0.598674)
> User Time 1119.51 (21.3284) User Time 1118.81 (21.7793)
> System Time 100.499 (0.389308) System Time 99.8335 (0.251423)
> Percent CPU 432.5 (136.974) Percent CPU 432.45 (136.922)
> Context Switches 81827.4 (78029.5) Context Switches 81818.5 (78051)
> Sleeps 97124.8 (9822.4) Sleeps 97207.9 (9955.04)
>
> 16-CPU host:
>
> Average Half load -j 8 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 213.538 (3.7891) Elapsed Time 212.5 (3.10939)
> User Time 1306.4 (1.83399) User Time 1307.65 (1.01364)
> System Time 194.59 (0.864378) System Time 195.478 (0.794588)
> Percent CPU 702.6 (13.5388) Percent CPU 707 (11.1131)
> Context Switches 21189.2 (1199.4) Context Switches 21288.2 (552.388)
> Sleeps 89390.2 (482.325) Sleeps 89677 (277.06)
>
> Average Optimal load -j 64 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 137.866 (0.787928) Elapsed Time 138.438 (0.218792)
> User Time 1488.92 (192.399) User Time 1489.92 (192.135)
> System Time 234.981 (42.5806) System Time 236.09 (42.8138)
> Percent CPU 1057.1 (373.826) Percent CPU 1057.1 (369.114)
> Context Switches 187514 (175324) Context Switches 187358 (175060)
> Sleeps 112633 (24535.5) Sleeps 111743 (23297.6)
>
> As you can see, there's no notable difference. I'll think of a
> microbenchmark though.

FWIW, I was about to send a very similar patch (but with only Xen-PV
enabling RCU-based free by default) and saw similar results with
kernbench, both Xen PV and baremetal.

>> Additionally, I see another option for us: enable 'rcu table free' on
>> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
>> patching for it) so bare metal and other hypervisors are not affected
>> by the change.
> It seems there's no need for that and we can keep things simple...
>
> -- Vitaly
>
> 0001-x86-enable-RCU-based-table-free-when-PARAVIRT.patch
>
>
> >From daf5117706920aebe793d1239fccac2edd4d680c Mon Sep 17 00:00:00 2001
> From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> Date: Mon, 14 Aug 2017 16:05:05 +0200
> Subject: [PATCH] x86: enable RCU based table free when PARAVIRT
>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/mm/pgtable.c | 16 ++++++++++++++++
> mm/memory.c | 5 +++++
> 3 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 781521b7cf9e..9c1666ea04c9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -167,6 +167,7 @@ config X86
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> select HAVE_REGS_AND_STACK_ACCESS_API
> + select HAVE_RCU_TABLE_FREE if SMP && PARAVIRT
> select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && STACK_VALIDATION
> select HAVE_STACK_VALIDATION if X86_64
> select HAVE_SYSCALL_TRACEPOINTS
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 508a708eb9a6..b6aaab9fb3b8 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -56,7 +56,11 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
> {
> pgtable_page_dtor(pte);
> paravirt_release_pte(page_to_pfn(pte));
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, pte);
> +#else
> tlb_remove_page(tlb, pte);
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 2
> @@ -72,21 +76,33 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
> tlb->need_flush_all = 1;
> #endif
> pgtable_pmd_page_dtor(page);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, page);
> +#else
> tlb_remove_page(tlb, page);
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 3
> void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
> {
> paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, virt_to_page(pud));
> +#else
> tlb_remove_page(tlb, virt_to_page(pud));
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 4
> void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
> {
> paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, virt_to_page(p4d));
> +#else
> tlb_remove_page(tlb, virt_to_page(p4d));
> +#endif

This can probably be factored out.

> }
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
> #endif /* CONFIG_PGTABLE_LEVELS > 3 */
> diff --git a/mm/memory.c b/mm/memory.c
> index e158f7ac6730..18d6671b6ae2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -329,6 +329,11 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
> * See the comment near struct mmu_table_batch.
> */
>
> +static void __tlb_remove_table(void *table)
> +{
> + free_page_and_swap_cache(table);
> +}
> +

This needs to be a per-arch routine (e.g. see arch/arm64/include/asm/tlb.h).

-boris