Based on latest mm-unstable (9ed079378408).
This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
Comparing to previous rfcv2, the major change is I dropped the new pgtable
lock but only use vma lock for locking. The major reason is I overlooked
that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of
the huge_pte_offset() call sites do need to take pgtable lock. It means
the valid users for the new RCU lock will be very limited.
It's possible that in the future we can rework the pgtable free to only
free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
to be within tlb_remove_table_rcu()), then the RCU lock will make more
sense. For now, make it simple by fixing the races first.
Since this version attached a reproducer (below) and also removed the RCU
(especially, the fallback irqoff) solution, removing RFC tag.
Old versions:
rfcv1: https://lore.kernel.org/r/20221030212929.335473-1-peterx@xxxxxxxxxx
rfcv2: https://lore.kernel.org/r/20221118011025.2178986-1-peterx@xxxxxxxxxx
Problem
=======
huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.
huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).
The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.
There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.
One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):