Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

From: David Hildenbrand
Date: Wed Nov 30 2022 - 04:47:33 EST


On 29.11.22 20:35, Peter Xu wrote:
Based on latest mm-unstable (9ed079378408).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

Comparing to previous rfcv2, the major change is I dropped the new pgtable
lock but only use vma lock for locking. The major reason is I overlooked
that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of
the huge_pte_offset() call sites do need to take pgtable lock. It means
the valid users for the new RCU lock will be very limited.

Thanks.


It's possible that in the future we can rework the pgtable free to only
free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
to be within tlb_remove_table_rcu()), then the RCU lock will make more
sense. For now, make it simple by fixing the races first.

Good.


Since this version attached a reproducer (below) and also removed the RCU
(especially, the fallback irqoff) solution, removing RFC tag.

Very nice, thanks.


Old versions:

rfcv1: https://lore.kernel.org/r/20221030212929.335473-1-peterx@xxxxxxxxxx
rfcv2: https://lore.kernel.org/r/20221118011025.2178986-1-peterx@xxxxxxxxxx

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

So just that I understand correctly:

Two processes, #A and #B, share a page table. Process #A runs two threads, #A1 and #A2.

#A1 walks that shared page table (using huge_pte_offset()), for example, to resolve a page fault. Concurrently, #A2 triggers unsharing of that page table (replacing it by a private page table), for example, using munmap().

So #A1 will eventually read/write the shared page table while we're placing a private page table. Which would be fine (assuming no unsharing would be required by #A1), however, if #B also concurrently drops the reference to the shared page table (), the shared page table could essentially get freed while #A1 is still walking it.

I suspect, looking at the reproducer, that the page table deconstructor was called. Will the page table also actually get freed already? IOW, could #A1 be reading/writing a freed page?



The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

Agreed.


There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

Thanks,

David / dhildenb