On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
Ah, right, currently in my side, I don't see any pinned pages areThis series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.
Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.
Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.
Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.
Why do we need all that, when we should simply not be applying PROT_NONE to
pinned pages?
In change_pte_range() we already have:
if (is_cow_mapping(vma->vm_flags) &&
page_count(page) != 1)
Which includes both, shared and pinned pages.
outside of this condition.
But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?
Staring at page #2, are we still missing something similar for THPs?Yes.
Why is that MMU notifier thingy and touching KVM code required?Because NUMA balancing code will firstly send .invalidate_range_start() with
event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
unconditionally, before it goes down into change_pte_range() and
change_huge_pmd() to check each page count and apply PROT_NONE.
Then current KVM will unmap all notified pages from secondary MMU
in .invalidate_range_start(), which could include pages that finally not
set to PROT_NONE in primary MMU.
For VMs with pass-through devices, though all guest pages are pinned,
KVM still periodically unmap pages in response to the
.invalidate_range_start() notification from auto NUMA balancing, which
is a waste.