Re: pipe/page fault oddness.

From: Linus Torvalds
Date: Wed Oct 01 2014 - 12:18:29 EST


On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> We need to get rid of it, and just make it the same as pte_protnone().
> And then the real protnone is in the vma flags, and if you actually
> ever get to a pte that is marked protnone, you know it's a numa page.

So I'd really suggest we do exactly that. Get rid of "pte_numa()"
entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
"pte_protnone()" helper to check for the "protnone" case (which on x86
is testing the _PAGE_PROTNONE bit, and on most other architectures is
just testing that the page has no access rights).

Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
because they are brainless sh*t, and we just use

ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
ptep_modify_prot_commit(mm, addr, pte, ptent);

reliably instead (where for the mknuma case "newprot" is PROT_NONE,
and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
have to pass in the vma to those functions, but that just makes sense
anyway.

And if that means that we lose the numa flag on mprotect etc, nobody sane cares.

Seriously, why can't we just do this, and throw away all the crap that
is "numa special case". This would make all the random games in
change_pte_range() just go away entirely, because the whole NUMA thing
really wouldn't be a special case for the pte AT ALL any more. All it
would be is that a pte could be marked PROT_NONE even if the
vma->vm_flags aren't.

Please, please, please? The current _PAGE_NUMA really is a horrible
horrible thing, and may well be the source of this bug.

The fact that it took DaveJ a long time to trigger his lockup would be
entirely consistent with "you have to split a PROTNONE large page due
to memory pressure", so the problem with our current pte_mknuma() that
Hugh points out looks entirely possible to me.

Now, there may be some reason why it can't happen, but even in the
absense of this bug, I really think that _PAGE_NUMA has been a huge
mistake from day one.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/