Re: [QUESTION] mmap lock in khugepaged

From: Dev Jain
Date: Fri Nov 29 2024 - 02:10:27 EST

Next message: Dave Young: "Re: [PATCH v2 2/2] x86/efi: Apply EFI Memory Attributes after kexec"
Previous message: Florian Westphal: "Re: [PATCH] netfilter: uapi: Fix file names for case-insensitive filesystem."
Next in thread: David Hildenbrand: "Re: [QUESTION] mmap lock in khugepaged"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

+the list and Zi Yan.

On 28/11/24 5:04 pm, David Hildenbrand wrote:

Maybe, we'd have to do the isolation+copy under the PMD lock. And
currently, we have to drop the PMD lock in order to have the
pte_offset_map_lock() work IIRC.

Is there a problem in holding two page table locks simultaneously?

Depending on CONFIG_SPLIT_PTE_PTLOCKS, it might not even be two locks (I assume we could have such configs with khugepaged).

Not sure if there could be an issue with lock inversion.

So I suspect this no not be 100% trivial :)

Most importantly, the copy that currently runs under no spinlocks
would now run under spinlock. Up to 512 MiB on arm64 64K, not sure if
that can be a problem ... we currently seem to take care of that

But we already are taking mmap_write_lock(), so that should not matter?

We are dealing with a spinlock vs. a rwsem.

We usually want to avoid holding spinlocks for an excessive amount of time, because all other CPUs waiting for that lock will ... spin with preemption disabled instead of rescheduling and doing something useful.

Ah okay.

Further, without CONFIG_SPLIT_PMD_PTLOCKS, in fact everybody who wnats to take a PMD lock in that process would be spinning on the same PMD lock :)

I mean, if we can get rid of the mmap exclusive lock, then the copying
would still be a bottleneck, and all fault handlers will back off, but

I'm trying to digest it once again, but I'm afraid I don't see how fault handlers will back off.

Won't they either see pmd_none(), to then either call pte_alloc_one() where they would spin on the PMD lock, or try allocating a PMD THP to insert it, and then spin on the PMD lock, to figure out later that it was all in vain?

Ya that's what I meant when I said "back off", sorry I wasn't clear, I didn't mean VM_FAULT_FALLBACK/RETRY or something...yes, the work will be in vain and the MMU will refault...

Thinking about it, I am also not sure if most other code can deal with temporary pmd_none(). These don't necessarily take the PMD lock, because "why would they" right now.

See walk_pmd_range() as one example, if it spots pmd_none() it assumes "there really is nothing" without checking the PMD lock.

As a more concrete example, assume someone calls MADV_DONTNEED and we end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" we'll skip that entry without getting the PMD lock involved. That would mean that you would be breaking MADV_DONTNEED if you managed to collapse or not -- memory would not get discarded.

This is a real problem with anonymous memory.

Yes, I thought of different locking fashions but the problem seems to be that any pagetable walker will choose an action path according to the value
it sees.

Unless I am missing something it's all very tricky and there might be a lot of such code that assumes "if I hold a mmap lock / VMA lock in read mode, pmd_none() means there is nothing even without holding the PMD lock when checking".

Yes, I was betting on the fact that, if the code assumes that pmd_none() means there is nothing, eventually when it will take the PMD lock to write to it, it will check whether
the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED thingy, thanks.

at least processes will be able to mmap() and do stuff with their VMAs,
and I would guess that this is worth optimizing...

It would certainly be interesting to get rid of the mmap lock in write mode here, but it's all rater tricky (and the code has rather nasty hidden implications).

pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
...
spun_unlock(pte);
} ...

result = __collapse_huge_page_copy(...);
pte_unmap(pte);

Deep in __collapse_huge_page_copy() we seem to re-rake the PTL lock.
No-split-spinlock confiogs might be problematic ...

Could you elaborate a little? I haven't read about the older config...

See above regarding CONFIG_SPLIT_PTE_PTLOCKS and friends.

I recall that for shmem that's "easier", because we don't have to
reinstall a PMD immediately, we cna be sure that the page table is
kept empty/unmodified, ...

All in all, the general solution seems to be that, if I can take all pagetable walkers into an invalid state and make them backoff, then I am safe.
For example, we do not zero out the PMD, we take the pte PTL, we do stuff, then instead of setting the PTEs to zero, we set it to a universal
invalid state upon which no pagetable walker can take an action; an instance of that can be to set the PTE to a swap entry such that the fault
handler ends up in do_swap_page() ->print_bad_pte(). So now we can take the PMD lock (in fact we don't need it since any pagetable walking
is rendered useless) and populate the PMD to resume the new pagetable walking. Another *ridiculous* idea may be to remember the PGD we
came from and nuke it (but I guess there is an alternate path for that in walk_pgd_range() and so on?)

Next message: Dave Young: "Re: [PATCH v2 2/2] x86/efi: Apply EFI Memory Attributes after kexec"
Previous message: Florian Westphal: "Re: [PATCH] netfilter: uapi: Fix file names for case-insensitive filesystem."
Next in thread: David Hildenbrand: "Re: [QUESTION] mmap lock in khugepaged"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]