Re: [RFC] Question about TLB flush while set Stage-2 huge pages

From: Marc Zyngier
Date: Tue Mar 12 2019 - 07:33:00 EST


Hi Zheng,

On 11/03/2019 16:31, Zheng Xiang wrote:
> Hi all,
>
> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
> the base address of the huge page and the whole of Stage-1.
> However, this just only invalidates the first page within the huge page and the other
> pages are not invalidated, see bellow:
>
> +---------------+--------------+
> |abcde 2MB-Page |
> +---------------+--------------+
>
> TLB before setting new pmd:
> +---------------+--------------+
> | VA | PAGESIZE |
> +---------------+--------------+
> | a | 4KB |
> +---------------+--------------+
> | b | 4KB |
> +---------------+--------------+
> | c | 4KB |
> +---------------+--------------+
> | d | 4KB |
> +---------------+--------------+
>
> TLB after setting new pmd:
> +---------------+--------------+
> | VA | PAGESIZE |
> +---------------+--------------+
> | a | 2MB |
> +---------------+--------------+
> | b | 4KB |
> +---------------+--------------+
> | c | 4KB |
> +---------------+--------------+
> | d | 4KB |
> +---------------+--------------+
>
> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.

That's really bad. I can only imagine two scenarios:

1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
the PTE table in the process, and place the PMD instead. I can't see
this happening.

2) We fail to invalidate on unmap, and that slightly less bad (but still
quite bad).

Which of the two cases are you seeing?

> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
> KVM will set the memslot READONLY and split the huge pages.
> After live migration is canceled and abort, the pages will be merged into THP.
> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>
> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.

We should perform an invalidate on each unmap. unmap_stage2_range seems
to do the right thing. __flush_tlb_range only caters for Stage1
mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
TLBs for the whole VM.

I'd really like to understand what you're seeing, and how to reproduce
it. Do you have a minimal example I could run on my own HW?

Thanks,

M.
--
Jazz is not dead. It just smells funny...