downgrade_write replacement in remap_file_pages

From: Andrea Arcangeli
Date: Tue Jun 08 2004 - 11:14:34 EST


Apparently downgrade_write deadlocks the kernel in the mmap_sem
under load. I suspect some rwsem bug. Anyways it's matter of time before
in my tree I replace the rwsem implementation with my spinlock based
common code implementation again that I can understand trivially (unlike
the current code). I don't mind a microoptimization when the code is so
complicated, so I don't mind too much to fix whatever current bug in
downgrade_write.

In the meantime to workaround the deadlock (and to verify if this make
the deadlock go away, which returned a positive result) I implemented
this patch: this doesn't fix downgrade_wite, but I'm posting it here
because I believe it's much better code regardless of the
downgrade_write issue. With this patch we'll never run down_write again
in production, the down_write will be taken only once when the db or the
simulator startup (during the very first page fault) and never again, in
turn providing (at least in theory) better runtime scalability.

Index: linux-2.5/mm/fremap.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/fremap.c,v
retrieving revision 1.24
diff -u -p -r1.24 fremap.c
--- linux-2.5/mm/fremap.c 23 May 2004 05:07:26 -0000 1.24
+++ linux-2.5/mm/fremap.c 8 Jun 2004 15:38:21 -0000
@@ -161,6 +161,7 @@ asmlinkage long sys_remap_file_pages(uns
unsigned long end = start + size;
struct vm_area_struct *vma;
int err = -EINVAL;
+ int has_write_lock = 0;

if (__prot)
return err;
@@ -181,7 +182,8 @@ asmlinkage long sys_remap_file_pages(uns
#endif

/* We need down_write() to change vma->vm_flags. */
- down_write(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);
+ retry:
vma = find_vma(mm, start);

/*
@@ -198,8 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
end <= vma->vm_end) {

/* Must set VM_NONLINEAR before any pages are populated. */
- if (pgoff != linear_page_index(vma, start) &&
- !(vma->vm_flags & VM_NONLINEAR)) {
+ if (unlikely(pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR))) {
+ if (!has_write_lock) {
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ has_write_lock = 1;
+ goto retry;
+ }
mapping = vma->vm_file->f_mapping;
spin_lock(&mapping->i_mmap_lock);
flush_dcache_mmap_lock(mapping);
@@ -212,8 +219,6 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}

- /* ->populate can take a long time, so downgrade the lock. */
- downgrade_write(&mm->mmap_sem);
err = vma->vm_ops->populate(vma, start, size,
vma->vm_page_prot,
pgoff, flags & MAP_NONBLOCK);
@@ -223,10 +228,11 @@ asmlinkage long sys_remap_file_pages(uns
* it after ->populate completes, and that would prevent
* downgrading the lock. (Locks can't be upgraded).
*/
+ }
+ if (likely(!has_write_lock))
up_read(&mm->mmap_sem);
- } else {
+ else
up_write(&mm->mmap_sem);
- }

return err;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/