Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()

From: Linus Torvalds
Date: Thu Jan 07 2010 - 13:45:31 EST




On Thu, 7 Jan 2010, Linus Torvalds wrote:
>
> For example: there's no real reason why we take mmap_sem for writing when
> extending an existing vma. And while 'brk()' is a very oldfashioned way of
> doing memory management, it's still quite common. So rather than looking
> at subtle lockless algorithms, why not look at doing the common cases of
> an extending brk? Make that one take the mmap_sem for _reading_, and then
> do the extending of the brk area with a simple cmpxchg or something?

I didn't use cmpxchg, because we actually want to update both
'current->brk' _and_ the vma->vm_end atomically, so here's a totally
untested patch that uses the page_table_lock spinlock for it instead (it
could be a new spinlock, not worth it).

It's also totally untested and might be horribly broken. But you get the
idea.

We could probably do things like this in regular mmap() too for the
"extend a mmap" case. brk() is just especially simple.

Linus

---
mm/mmap.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ee22989..3d07e5f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -242,23 +242,14 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
return next;
}

-SYSCALL_DEFINE1(brk, unsigned long, brk)
+static long slow_brk(unsigned long brk)
{
unsigned long rlim, retval;
unsigned long newbrk, oldbrk;
struct mm_struct *mm = current->mm;
- unsigned long min_brk;

down_write(&mm->mmap_sem);

-#ifdef CONFIG_COMPAT_BRK
- min_brk = mm->end_code;
-#else
- min_brk = mm->start_brk;
-#endif
- if (brk < min_brk)
- goto out;
-
/*
* Check against rlimit here. If this check is done later after the test
* of oldbrk with newbrk then it can escape the test and let the data
@@ -297,6 +288,84 @@ out:
return retval;
}

+/*
+ * NOTE NOTE NOTE!
+ *
+ * We do all this speculatively with just the read lock held.
+ * If anything goes wrong, we release the read lock, and punt
+ * to the traditional write-locked version.
+ *
+ * Here "goes wrong" includes:
+ * - having to unmap the area and actually shrink it
+ * - the final cmpxchg doesn't succeed
+ * - the old brk area wasn't a simple anonymous vma
+ * etc etc
+ */
+static struct vm_area_struct *ok_to_extend_brk(struct mm_struct *mm, unsigned long cur_brk, unsigned long brk)
+{
+ struct vm_area_struct *vma, *nextvma;
+
+ nextvma = find_vma_prev(mm, cur_brk, &vma);
+ if (vma) {
+ /*
+ * Needs to be an anonymous vma that ends at the current brk,
+ * and with no following vma that would start before the
+ * new brk
+ */
+ if (vma->vm_file || cur_brk != vma->vm_end || (nextvma && brk > nextvma->vm_start))
+ vma = NULL;
+ }
+ return vma;
+}
+
+SYSCALL_DEFINE1(brk, unsigned long, brk)
+{
+ unsigned long cur_brk, min_brk;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ cur_brk = mm->brk;
+#ifdef CONFIG_COMPAT_BRK
+ min_brk = mm->end_code;
+#else
+ min_brk = mm->start_brk;
+#endif
+ if (brk < min_brk)
+ goto out;
+
+ brk = PAGE_ALIGN(brk);
+ cur_brk = PAGE_ALIGN(cur_brk);
+
+ if (brk < cur_brk)
+ goto slow_case;
+ if (brk == cur_brk)
+ goto out;
+
+ vma = ok_to_extend_brk(mm, cur_brk, brk);
+ if (!vma)
+ goto slow_case;
+
+ spin_lock(&mm->page_table_lock);
+ if (vma->vm_end == cur_brk) {
+ vma->vm_end = brk;
+ mm->brk = brk;
+ cur_brk = brk;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (cur_brk != brk)
+ goto slow_case;
+
+out:
+ up_read(&mm->mmap_sem);
+ return cur_brk;
+
+slow_case:
+ up_read(&mm->mmap_sem);
+ return slow_brk(brk);
+}
+
#ifdef DEBUG_MM_RB
static int browse_rb(struct rb_root *root)
{
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/