Re: [patch 9/9] mm: fix pagecache write deadlocks

From: Andrew Morton
Date: Fri Feb 02 2007 - 18:54:08 EST


On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <npiggin@xxxxxxx> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
>
> 1. generic_buffered_write
> 2. lock_page
> 3. prepare_write
> 4. unlock_page+vmtruncate
> 5. copy_from_user
> 6. mmap_sem(r)
> 7. handle_mm_fault
> 8. lock_page (filemap_nopage)
> 9. commit_write
> 10. unlock_page
>
> a. sys_munmap / sys_mlock / others
> b. mmap_sem(w)
> c. make_pages_present
> d. get_user_pages
> e. handle_mm_fault
> f. lock_page (filemap_nopage)
>
> 2,8 - recursive deadlock if page is same
> 2,8;2,8 - ABBA deadlock is page is different
> 2,6;b,f - ABBA deadlock if page is same
>
> The solution is as follows:
> 1. If we find the destination page is uptodate, continue as normal, but use
> atomic usercopies which do not take pagefaults and do not zero the uncopied
> tail of the destination. The destination is already uptodate, so we can
> commit_write the full length even if there was a partial copy: it does not
> matter that the tail was not modified, because if it is dirtied and written
> back to disk it will not cause any problems (uptodate *means* that the
> destination page is as new or newer than the copy on disk).
>
> 1a. The above requires that fault_in_pages_readable correctly returns access
> information, because atomic usercopies cannot distinguish between
> non-present pages in a readable mapping, from lack of a readable mapping.
>
> 2. If we find the destination page is non uptodate, unlock it (this could be
> made slightly more optimal), then find and pin the source page with
> get_user_pages. Relock the destination page and continue with the copy.
> However, instead of a usercopy (which might take a fault), copy the data
> via the kernel address space.
>

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case. We've always tried to avoid doing a pte-walk in the write() path to
fix this bug. Careful performance testing is needed here so we can assess
the impact. For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases. For example, we don't yet have a page to run page_mapped()
against.

> break;
> }
>
> + /*
> + * non-uptodate pages cannot cope with short copies, and we
> + * cannot take a pagefault with the destination page locked.
> + * So pin the source page to copy it.
> + */
> + if (!PageUptodate(page)) {
> + unlock_page(page);
> +
> + bytes = min(bytes, PAGE_CACHE_SIZE -
> + ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> + /*
> + * Cannot get_user_pages with a page locked for the
> + * same reason as we can't take a page fault with a
> + * page locked (as explained below).
> + */
> + down_read(&current->mm->mmap_sem);
> + status = get_user_pages(current, current->mm,
> + (unsigned long)buf & PAGE_CACHE_MASK, 1,
> + 0, 0, &src_page, NULL);
> + up_read(&current->mm->mmap_sem);
> + if (status != 1) {
> + page_cache_release(page);
> + break;
> + }
> +
> + lock_page(page);
> + if (!page->mapping) {

Hopefully this can't happen? If it can, who went and took our page off the
mapping? Reclaim? The elevated page_count will prevent that?

> + unlock_page(page);
> + page_cache_release(page);
> + page_cache_release(src_page);
> + continue;
> + }
> +
> + }
> +
> status = a_ops->prepare_write(file, page, offset, offset+bytes);
> if (unlikely(status))
> goto fs_write_aop_error;
>
> - copied = filemap_copy_from_user(page, offset,
> + if (!src_page) {
> + /*
> + * Must not enter the pagefault handler here, because
> + * we hold the page lock, so we might recursively
> + * deadlock on the same lock, or get an ABBA deadlock
> + * against a different lock, or against the mmap_sem
> + * (which nests outside the page lock). So increment
> + * preempt count, and use _atomic usercopies.
> + *
> + * The page is uptodate so we are OK to encounter a
> + * short copy: if unmodified parts of the page are
> + * marked dirty and written out to disk, it doesn't
> + * really matter.
> + */
> + pagefault_disable();
> + copied = filemap_copy_from_user_atomic(page, offset,
> cur_iov, nr_segs, iov_offset, bytes);
> + pagefault_enable();
> + } else {
> + char *src, *dst;
> + src = kmap(src_page);
> + dst = kmap(page);
> + memcpy(dst + offset,
> + src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> + bytes);
> + kunmap(page);
> + kunmap(src_page);
> + copied = bytes;
> + }

As you say, we don't want to do this: taking two kmap()s at the same time
leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
to give one back.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/