Re: PTE access rules & abstraction

From: Benjamin Herrenschmidt
Date: Tue Sep 23 2008 - 02:51:06 EST



> A good first step might be to define some conventions. For example,
> define that set_pte*() *always* means setting a non-valid pte to either
> a new non-valid state (like a swap reference) or to a valid state.
> modify_pte() would modify the flags of a valid
> pte, giving a new valid pte. etc...

Yup. Or make it clear that ptep_set_access_flags() should only be used
to -relax- access (ie, set dirty, writeable, accessed, ... but not
remove any of them).

> It may be that a given architecture collapses some or all of these down
> to the same underlying functionality, but it would allow the core intent
> to be clearly expressed.
>
> What is the complete set of primitives we need? I also noticed that a
> number of the existing pagetable operations are used only once or twice
> in the core code; I wonder if we really need such special cases, or
> whether we can make each arch pte operation carry a bit more weight?

Yes, that was some of my concern. It's getting close to having one API
per call site :-)

> Also, rather than leaving all the rule enforcing to documentation and a
> maintainer, we should also consider having a debug mode which adds
> enough paranoid checks to each operation so that any rule breakage will
> fail obviously on all architectures.

We could do both.

Now, regarding operations, let's first find the major call sites, see
what I miss. I'm omitting free_* in memory.c as those are for freeing
pte pages, not accessing PTEs themselves. I'm also ignoring read-only
call sites and hugetlb for now.

* None-iterative accessors

- handle_pte_fault in memory.c, on "fixup" faults (pte is present and
it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make
that also fixup EXEC ? I would like this for some stuff I'm working on
at the moment, ie set it if the vma has VM_EXEC and it was lost from the
PTE as I might want to mask it out of PTEs under some circumstances).
Textbook usage of ptep_set_access_flags(), so that's fine.

- do_wp_page() in memory.c for COW or fixup of shared writeable mapping
writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses
clear_flush nowadays and fixup of shared writeable mapping uses
ptep_set_access_flags() as it should, so that's all good.

- insert_pfn() and insert_page() still in memory.c for fancy page
faults. Just a trivial set_pte_at() of a !present one, no big deal here

- RMAP ones ? Some ad-hoc stuff due to _notify thingies.

* Iterative accessors (some don't batch, maybe they could/should).

- zapping a mapping (zap_p*) in memory.c
- fork (copy_p*) in memory.c could batch better maybe ?
- setting linear user mappings (remap_p*) in memory.c, trivial
set_pte_at() on a range, pte's should be !present I think.
- mprotect (change_p*) in memory.c, which has the problem I mentioned
- moving page tables (move_p*), pretty trivial clear_flush + set_pte_at
- clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does
a test_and_clear_young, flushes mm afterward, could use some lazy stuff
so we can batch properly on ppc64.
- vmalloc, that's a bit special and kernel only, doesn't have nasty
races between creating/tearing down mappings vs. using them
- highmem I leave alone for now, it's mostly trivial set_pte_at &
flushing for normal kmap but kmap_atomic can be nasty, though it's arch
specific.
- some stuff in fremap I'm not too familiar with and I need to run...

What did I miss ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/