Re: [PATCH 1/6] mm: tracking shared dirty pages

From: Hugh Dickins
Date: Thu Jun 22 2006 - 16:51:22 EST


On Mon, 19 Jun 2006, Peter Zijlstra wrote:
>
> People expressed the need to track dirty pages in shared mappings.
>
> Linus outlined the general idea of doing that through making clean
> writable pages write-protected and taking the write fault.
>
> Index: 2.6-mm/include/linux/mm.h
> ===================================================================
> --- 2.6-mm.orig/include/linux/mm.h 2006-06-14 10:29:04.000000000 +0200
> +++ 2.6-mm/include/linux/mm.h 2006-06-19 14:45:18.000000000 +0200
> @@ -182,6 +182,12 @@ extern unsigned int kobjsize(const void
> #define VM_SequentialReadHint(v) ((v)->vm_flags & VM_SEQ_READ)
> #define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ)
>
> +static inline int is_shared_writable(unsigned int flags)
> +{
> + return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> + (VM_SHARED|VM_WRITE);
> +}
> +

Andrew asked for the inclusion of VM_PFNMAP to be commented there,
I don't believe that's enough: a function called "is_shared_writable"
should be testing precisely that, or people will misuse it.

Either you change the name to "is_shared_writable_but_not_pfnmap"
or somesuch, or you split out the VM_PFNMAP test, or you do away
with the function and make the tests explicit inline. As before,
my instinctive preference is the latter: I really want to see what's
being tested (especially in do_wp_page); but perhaps it'll just look
too ugly all over - give it a try and see.

> /*
> * mapping from the currently active vm_flags protection bits (the
> * low four bits) to a page protection mask..
> Index: 2.6-mm/mm/memory.c
> ===================================================================
> --- 2.6-mm.orig/mm/memory.c 2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/memory.c 2006-06-19 16:20:06.000000000 +0200
> @@ -938,6 +938,12 @@ struct page *follow_page(struct vm_area_
> pte = *ptep;
> if (!pte_present(pte))
> goto unlock;
> + /*
> + * This is not fully correct in the light of trapping write faults
> + * for writable shared mappings. However since we're going to mark
> + * the page dirty anyway some few lines downward, we might as well
> + * take the write fault now.
> + */

I don't understand what you're getting at here: please explain,
what is not fully correct and why? In mail first, then we can
decide what the comment should say, or if it should be removed.
follow_page isn't making a pte writable, so what's the issue?

> if ((flags & FOLL_WRITE) && !pte_write(pte))
> goto unlock;
> page = vm_normal_page(vma, address, pte);
> @@ -1458,13 +1464,14 @@ static int do_wp_page(struct mm_struct *
> {
> struct page *old_page, *new_page;
> pte_t entry;
> - int reuse, ret = VM_FAULT_MINOR;
> + int reuse = 0, ret = VM_FAULT_MINOR;
> + struct page *dirty_page = NULL;
>
> old_page = vm_normal_page(vma, address, orig_pte);
> if (!old_page)
> goto gotten;
>
> - if (unlikely(vma->vm_flags & VM_SHARED)) {
> + if (unlikely(is_shared_writable(vma->vm_flags))) {

Most interesting line in the series, yes, and I'd find it
easier to think through if it showed the flags test explicitly:
if ((vma->vm_flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
(VM_SHARED|VM_WRITE))

Yes, Andrew, you're right it's a change in behaviour from David's
page_mkwrite patch. I've realized that when I was originally
reviewing David's patch, I believed do_wp_page was mistaken to be
doing COW on VM_SHARED areas. But Linus has since asserted very
forcefully that it's intentional, that ptrace poke on a VM_SHARED
area which is currently not !VM_WRITE should COW it, so I mentioned
that to Peter.

Has he got the test right there now? Ummm... maybe: my brain
exploded weeks ago. Several strangenesses collide here, I'll
try again tomorrow, maybe others will argue it to certainty before.

> if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> /*
> * Notify the address space that the page is about to
> Index: 2.6-mm/mm/mmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/mmap.c 2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mmap.c 2006-06-19 15:41:53.000000000 +0200
> @@ -25,6 +25,7 @@
> #include <linux/mount.h>
> #include <linux/mempolicy.h>
> #include <linux/rmap.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file
> struct rb_node ** rb_link, * rb_parent;
> int accountable = 1;
> unsigned long charged = 0, reqprot = prot;
> + struct address_space *mapping = NULL;
>
> if (file) {
> if (is_file_hugepages(file))
> @@ -1084,18 +1086,13 @@ munmap_back:
> error = file->f_op->mmap(file, vma);
> if (error)
> goto unmap_and_free_vma;
> +

Do you really need this blank line?

> } else if (vm_flags & VM_SHARED) {
> error = shmem_zero_setup(vma);
> if (error)
> goto free_vma;
> }
>
> - /* Don't make the VMA automatically writable if it's shared, but the
> - * backer wishes to know when pages are first written to */
> - if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> - vma->vm_page_prot =
> - protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> -
> /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
> * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
> * that memory reservation must be checked; but that reservation
> @@ -1113,6 +1110,31 @@ munmap_back:
> pgoff = vma->vm_pgoff;
> vm_flags = vma->vm_flags;
>
> + /*
> + * Tracking of dirty pages for shared writable mappings. Do this by
> + * write protecting writable pages, and mark dirty in the write fault.
> + *
> + * Modify vma->vm_page_prot (the default protection for new pages)
> + * to this effect.
> + *
> + * Cannot do before because the condition depends on:
> + * - backing_dev_info having the right capabilities
> + * (set by f_op->open())

Is that so, backing_dev_info set by f_op->open()?
And how would that be a problem here if it were so?

> + * - vma->vm_flags being fully set
> + * (finished in f_op->mmap(), which could call remap_pfn_range())
> + *
> + * Also, cannot reset vma->vm_page_prot from vma->vm_flags because
> + * f_op->mmap() can modify it.
> + */
> + if (is_shared_writable(vm_flags) && vma->vm_file)
> + mapping = vma->vm_file->f_mapping;
> + if ((mapping && mapping_cap_account_dirty(mapping)) ||
> + (vma->vm_ops && vma->vm_ops->page_mkwrite))

The only way "mapping" might be set is just above.
Wouldn't it all be clearer (though more indented) if you said

if (is_shared_writable(vm_flags) && vma->vm_file) {
mapping = vma->vm_file->f_mapping;
if ((mapping && mapping_cap_account_dirty(mapping)) ||
(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
vma->vm_page_prot = whatever;
}
}

Or no need for "mapping" here at all if you change
mapping_cap_account_dirty(vma->vm_file->f_mapping)
to do the right thing with NULL.

> + vma->vm_page_prot =
> + __pgprot(pte_val
> + (pte_wrprotect
> + (__pte(pgprot_val(vma->vm_page_prot)))));
> +

In other mail I've suggested saving vm_page_prot above, and
changing it here only if the driver's ->mmap did not change it.

I remain uneasy about interfering with the permissions expected by
strange drivers, but can't really justify my paranoia. Certainly
you're right to exclude VM_PFNMAPs from this interference, that's
important; I'd be less uneasy if you also exclude VM_INSERTPAGEs,
they're strange too - but at least they're dealing with proper struct
pages, so should be able to handle an unexpected do_wp_page; that
leaves the driver nopage cases, which again should be okay now you're
(one way or another) protecting specially added vm_page_prot flags.

I guess I'm just paranoid; it's irritating me that we do not have
the right backing_dev_infos in place and having to hack around it.

> if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
> vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
> file = vma->vm_file;
> Index: 2.6-mm/mm/mprotect.c
> ===================================================================
> --- 2.6-mm.orig/mm/mprotect.c 2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mprotect.c 2006-06-19 16:19:42.000000000 +0200
> @@ -21,6 +21,7 @@
> #include <linux/syscalls.h>
> #include <linux/swap.h>
> #include <linux/swapops.h>
> +#include <linux/backing-dev.h>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> #include <asm/cacheflush.h>
> @@ -124,6 +125,7 @@ mprotect_fixup(struct vm_area_struct *vm
> long nrpages = (end - start) >> PAGE_SHIFT;
> unsigned long charged = 0;
> unsigned int mask;
> + struct address_space *mapping = NULL;
> pgprot_t newprot;
> pgoff_t pgoff;
> int error;
> @@ -179,7 +181,10 @@ success:
> /* Don't make the VMA automatically writable if it's shared, but the
> * backer wishes to know when pages are first written to */
> mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
> - if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> + if (is_shared_writable(newflags) && vma->vm_file)
> + mapping = vma->vm_file->f_mapping;
> + if ((mapping && mapping_cap_account_dirty(mapping)) ||
> + (vma->vm_ops && vma->vm_ops->page_mkwrite))

Similar remarks on indenting,
or letting mapping_cap_account_dirty take NULL mapping.

> mask &= ~VM_SHARED;
>
> newprot = protection_map[newflags & mask];
> Index: 2.6-mm/mm/rmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/rmap.c 2006-06-14 10:29:07.000000000 +0200
> +++ 2.6-mm/mm/rmap.c 2006-06-19 14:45:18.000000000 +0200
> @@ -53,6 +53,7 @@
> #include <linux/rmap.h>
> #include <linux/rcupdate.h>
> #include <linux/module.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/tlbflush.h>
>
> @@ -434,6 +435,73 @@ int page_referenced(struct page *page, i
> return referenced;
> }
>
> +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long address;
> + pte_t *pte, entry;
> + spinlock_t *ptl;
> + int ret = 0;
> +
> + address = vma_address(page, vma);
> + if (address == -EFAULT)
> + goto out;
> +
> + pte = page_check_address(page, mm, address, &ptl);
> + if (!pte)
> + goto out;
> +
> + if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
> + goto unlock;
> +
> + entry = ptep_get_and_clear(mm, address, pte);
> + entry = pte_mkclean(entry);
> + if (protect)
> + entry = pte_wrprotect(entry);
> + ptep_establish(vma, address, pte, entry);
> + lazy_mmu_prot_update(entry);
> + ret = 1;
> +
> +unlock:
> + pte_unmap_unlock(pte, ptl);
> +out:
> + return ret;
> +}
> +
> +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> +{
> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + int ret = 0;
> +
> + BUG_ON(PageAnon(page));
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> + int protect = mapping_cap_account_dirty(mapping) &&
> + is_shared_writable(vma->vm_flags);
> + ret += page_mkclean_one(page, vma, protect);

You have a good point here, one I'd completely missed: because a vma
may have been recently mprotected !VM_WRITE, you have to check readonly
mappings too. Perhaps worth a comment. But I think "is_shared_writable"
is not the best test here: just test for VM_SHARED vmas, they're the
only ones which can be mprotected to/from shared writable. And then
I think you don't need to pass down an additional "protect" argument?
It's only being called for mapping_cap_account_dirty mappings anyway,
isn't it?

> + }
> + spin_unlock(&mapping->i_mmap_lock);
> + return ret;
> +}

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/