d) Fix the pte walker to do the right thing, then just use separate
pte walkers in your code
The fix would be those two conceptual changes:
1) don't split if the walker asks for a pmd_entry (the walker itself
can then decide to split, of course, but right now no walkers want it
since there are no pmd _and_ pte walkers, because people who want that
do the pte walk themselves)
2) get the proper page table lock if you do walk the pte, since
otherwise it's racy
Then there won't be any code duplication, because all the duplication
you now have at the pmd level is literally just workarounds for the
fact that our current walker has this bug.
That "fix the pte walker" would be one preliminary patch that would
look something like the attached TOTALLY UNTESTED garbage.
I call it "garbage" because I really hope people take it just as what
it is: "something like this". It compiles for me, and I did try to
think it through, but I might have missed some big piece of the
picture when writing that patch.
And yes, this is a much bigger conceptual change for the VM layer, but
I really think our pagewalk code is actively buggy right now, and is
forcing users to do bad things because they work around the existing
limitations.
Hmm? Could some of the core mm people look over that patch?
And yes, I was tempted to move the proper pmd locking into the walker
too, and do
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
err = ops->pmd_entry(pmd, addr, next, walk);
spin_unlock(ptl);
...
but while I think that's the correct thing to do in the long run, that
would have to be done together with changing all the existing
pmd_entry users. It would make the pmd_entry _solely_ handle the
hugepage case, and then you'd have to remove the locking in the
pmd_entry, and have to make the pte walking be a walker entry. But
that would _really_ clean things up, and would make things like
smaps_pte_range() much easier to read, and much more obvious (it would
be split into a smaps_pmd_range and smaps_pte_range, and the callbacks
wouldn't need to know about the complex locking).
So I think this is the right direction to move into, but I do want
people to think about this, and think about that next phase of doing
the pmd_trans_huge_lock too.
Comments?
Linus