Re: anon_vma RFC2

From: Linus Torvalds
Date: Sat Mar 13 2004 - 11:15:52 EST




Ok, guys,
how about this anon-page suggestion?

I'm a bit nervous about the complexity issues in Andrea's current setup,
so I've been thinking about Rik's per-mm thing. And I think that there is
one very simple approach, which should work fine, and should have minimal
impact on the existing setup exactly because it is so simple.

Basic setup:
- each anonymous page is associated with exactly _one_ virtual address,
in a "anon memory group".

We put the virtual address (shifted down by PAGE_SHIFT) into
"page->index". We put the "anon memory group" pointer into
"page->mapping". We have a PAGE_ANONYMOUS flag to tell the
rest of the world about this.

- the anon memory group has a list of all mm's that it is associated
with.

- an "execve()" creates a new "anon memory group" and drops the old one.

- a mm copy operation just increments the reference count and adds the
new mm to the mm list for that anon memory group.

So now to do reverse mapping, we can take a page, and do

if (PageAnonymous(page)) {
struct anongroup *mmlist = (struct anongroup *)page->mapping;
unsigned long address = page->index << PAGE_SHIFT;
struct mm_struct *mm;

for_each_entry(mm, mmlist->anon_mms, anon_mm) {
.. look up page in page tables in "mm, address" ..
.. most of the time we may not even need to look ..
.. up the "vma" at all, just walk the page tables ..
}
} else {
/* Shared page */
.. look up page using the inode vma list ..
}

The above all works 99% of the time.

The only problem is mremap() after a fork(), and hell, we know that's a
special case anyway, and let's just add a few lines to copy_one_pte(),
which basically does:

if (PageAnonymous(page) && page->count > 1) {
newpage = alloc_page();
copy_page(page, newpage);
page = newpage;
}
/* Move the page to the new address */
page->index = address >> PAGE_SHIFT;

and now we have zero special cases.

The above should work very well. In most cases the "anongroup" will be
very small, and even when it's large (if somebody does a ton of forks
without any execve's), we only have _one_ address to check, and that is
pretty fast. A high-performance server would use threads, anyway. (And
quite frankly, _any_ algorithm will have this issue. Even rmap will have
exactly the same loop, although rmap skips any vm's where the page might
have been COW'ed or removed).

The extra COW in mremap() seems benign. Again, it should usually not even
trigger.

What do you think? To me, this seems to be a really simple approach..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/