On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas HellstrÃm (VMware) wrote:
I *think* it should be fixed with something like this (there's noI was specifically thinking of this:+ * If @mapping allows faulting of huge pmds and puds, it is desirableEm. No. We have ptl for this. It's the only lock required (plus mmap_sem
+ * that its huge_fault() handler blocks while this function is running on
+ * @mapping. Otherwise a race may occur where the huge entry is split when
+ * it was intended to be handled in a huge entry callback. This requires an
+ * external lock, for example that @mapping->i_mmap_rwsem is held in
+ * write mode in the huge_fault() handlers.
on read) to split PMD entry into PTE table. And it can happen not only
from fault path.
If you care about splitting compound page under you, take a pin or lock a
page. It will block split_huge_page().
Suggestion to block fault path is not viable (and it will not happen
magically just because of this comment).
https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
If a huge pud is concurrently faulted in here, it will immediatly get split
without getting processed in pud_entry(). An external lock would protect
against that, but that's perhaps a bug in the pagewalk code? For pmds the
situation is not the same since when pte_entry is used, all pmds will
unconditionally get split.
pud_trans_unstable() yet):
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..221a3b945f42 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
break;
continue;
}
+ } else {
+ split_huge_pud(walk->vma, pud, addr);
}
- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
+ if (pud_none(*pud) || pud_trans_unstable(*pud))
goto again;
if (ops->pmd_entry || ops->pte_entry)
Yes, I misinterpreted the code somewhat, but here's the scenario that looks racy:
Or better yet converted to what we do on pmd level.
Honestly, all the code around PUD THP missing a lot of ground work.
Rushing it upstream for DAX was not a right move.
There's a similar more scary race inHm? It will fail the next pmd_none() check under ptl. Do you have a
https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
It looks like if a concurrent thread faults in a huge pud just after the
test for pud_none in that pmd_alloc, things might go pretty bad.
particular racing scenarion?