Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

From: David Hildenbrand
Date: Fri Apr 05 2024 - 02:54:05 EST


On 05.04.24 00:02, Sean Christopherson wrote:
On Thu, Apr 04, 2024, David Hildenbrand wrote:
On 04.04.24 19:31, Sean Christopherson wrote:
On Thu, Apr 04, 2024, David Hildenbrand wrote:
On 04.04.24 00:19, Sean Christopherson wrote:
Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
to be invalidated before consuming dirty status. Isn't the end result essentially
a sane FOLL_TOUCH?

Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.

Having something that makes sure the writable PTE/PMD is dirty (or
alternatively sets it dirty), paired with MMU notifiers notifying on any
mkclean would be one option that would leave handling how to handle dirtying
of folios completely to the core. It would behave just like a CPU writing to
the page table, which would set the pte dirty.

Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
(like we discussed for the accessed bit), that would not be an option. But
from what I recall, only clearing the PTE/PMD dirty bit is rather rare.

And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
it would probably be a net positive, e.g. the notification could more precisely
say that SPTEs need to be read-only, not blasted away completely.

As discussed, I think at least madvise_free_pte_range() wouldn't do that.

I'm getting a bit turned around. Are you talking about what madvise_free_pte_range()
would do in this future world, or what madvise_free_pte_range() does today? Because
today, unless I'm really misreading the code, secondary MMUs are invalidated before
the dirty bit is cleared.

Likely I missed it, sorry! I was talking about the possible future. :)


mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
range.start, range.end);

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);

mmu_notifier_invalidate_range_start(&range);
tlb_start_vma(&tlb, vma);
walk_page_range(vma->vm_mm, range.start, range.end,
&madvise_free_walk_ops, &tlb);
tlb_end_vma(&tlb, vma);
mmu_notifier_invalidate_range_end(&range);


Indeed, we do setup the MMU notifier invalidation. We do the start/end .. I was looking for PTE notifications.

I spotted the set_pte_at(), not a set_pte_at_notify() like we do in other code. Maybe that's not required here (digging through documentation I'm still left clueless).

"
set_pte_at_notify() sets the pte _after_ running the notifier. This is safe to start by updating the secondary MMUs, because the primary MMU pte invalidate must have already happened with a ptep_clear_flush() before set_pte_at_notify() has been invoked. ...
"

Absolutely unclear to me when we *must* to use it, or if it is. Likely its a pure optimization when we're *changing* a PTE.

KSM and wp_page_copy() uses it with MMU_NOTIFY_CLEAR, when replacing the mapped page by another page (or write-protecting an existing one). __replace_page() uses it with __replace_page() when replacing the mapped page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.

Other code (e.g., khugepaged) doesn't seem to use it as well.

.. so I guess it's fine? Confusing.

--
Cheers,

David / dhildenb