Re: [PATCH 2/4] AFS: Add a function to excise a rejected write fromthe pagecache
From: Andrew Morton
Date: Thu May 24 2007 - 16:38:42 EST
On Wed, 23 May 2007 20:15:24 +0100
David Howells <dhowells@xxxxxxxxxx> wrote:
> Add a function - cancel_rejected_write() - to excise a rejected write from the
> pagecache. This function is related to the truncation family of routines. It
> permits the pages modified by a network filesystem client (such as AFS) to be
> excised and discarded from the pagecache if the attempt to write them back to
> the server fails.
Could you please flesh this out a bit, from a higher level?
As in: what does it mean for a server to "reject" a write? What's actually
going on here? Why does the server do this? I assume that the application
will see a short write or an EIO or something?
How does this interact with MAP_SHARED?
Do we expect that any other networked filesystem would want to do this?
(and if not, why not?) Or do they already attempt to do it in some other
fashion?
> The dirty and writeback states of the afflicted pages are cancelled and the
> pages themselves are detached for recycling. All PTEs referring to those
> pages are removed.
>
> Note that the locking is tricky as it's very easy to deadlock against
> truncate() and other routines once the pages have been unlocked as part of the
> writeback process. To this end, the PG_error flag is set, then the
> PG_writeback flag is cleared, and only *then* can lock_page() be called.
hm.
> Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
> ---
>
> include/linux/mm.h | 5 ++-
> mm/truncate.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 86 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e4183c6..73688d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
>
> extern unsigned long do_brk(unsigned long, unsigned long);
>
> -/* filemap.c */
> -extern unsigned long page_unuse(struct page *);
> +/* truncate.c */
> extern void truncate_inode_pages(struct address_space *, loff_t);
> extern void truncate_inode_pages_range(struct address_space *,
> loff_t lstart, loff_t lend);
> +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
>
> +/* filemap.c */
> /* generic vm_area_ops exported for stackable file systems */
> extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
> extern int filemap_populate(struct vm_area_struct *, unsigned long,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4fbe1a2..f742096 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
> return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> +
> +/*
> + * Cancel that part of a rejected write that affects a particular page
> + */
> +static void cancel_rejected_page(struct address_space *mapping,
> + struct page *page, pgoff_t *_next)
> +{
> + if (!TestSetPageError(page)) {
> + /* can't lock the page until we've cleared PG_writeback lest we
> + * deadlock with truncate (amongst other things) */
> + end_page_writeback(page);
> + if (page->mapping == mapping) {
> + lock_page(page);
> + if (page->mapping == mapping) {
> + truncate_complete_page(mapping, page);
> + *_next = page->index + 1;
> + }
> + unlock_page(page);
> + }
> + } else if (PageWriteback(page) || PageDirty(page)) {
> + BUG();
> + }
> +}
Why can't we just run the end_page_writeback() unconditionally?
PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
so this thread of control "owns" that flag.
Also, are you really really sure that there is no way in which PG_writeback
can get itself set again after the end_page_writeback()? I'd have thought
that we should be doing a wait_on_page_writeback() after the lock_page()
there. Remember, other filesystems might want to be calling this, so we
shouldn't be designing around AFS implementation specifics.
> +/**
> + * cancel_rejected_write - Cancel a write on a contiguous set of pages
> + * @mapping: mapping affected
> + * @start: first page in set
> + * @end: last page in set
> + *
> + * Cancel a write of a contiguous set of pages when the writeback was rejected
> + * by the target medium or server.
> + *
> + * The pages in question are detached and discarded from the pagecache, and the
> + * writeback and dirty states are cleared prior to invalidation. The caller
> + * must make sure that all the pages in the range are present in the pagecache,
> + * and the caller must hold PG_writeback on each of them. NOTE! All the pages
> + * are locked and unlocked as part of this process, so the caller must take
> + * care to avoid deadlock.
> + *
> + * The PTEs pointing to those pages are also cleared, leading to the PTEs being
> + * reset when new pages are allocated and the contents reloaded.
> + */
> +void cancel_rejected_write(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + struct pagevec pvec;
> + pgoff_t n;
> + int i;
> +
> + BUG_ON(mapping->nrpages < end - start + 1);
> +
> + /* dispose of any PTEs pointing to the affected pages */
> + unmap_mapping_range(mapping,
> + (loff_t)start << PAGE_CACHE_SHIFT,
> + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> + 0);
> +
> + pagevec_init(&pvec, 0);
> + do {
> + cond_resched();
> + n = end - start + 1;
> + if (n > PAGEVEC_SIZE)
> + n = PAGEVEC_SIZE;
> + n = pagevec_lookup(&pvec, mapping, start, n);
> + for (i = 0; i < n; i++) {
> + struct page *page = pvec.pages[i];
> +
> + if (page->index < start || page->index > end)
> + continue;
> + start++;
> + cancel_rejected_page(mapping, page, &start);
> + }
> + pagevec_release(&pvec);
> + } while (start - 1 < end);
> +
> + /* dispose of any new PTEs pointing to the affected pages */
> + unmap_mapping_range(mapping,
> + (loff_t)start << PAGE_CACHE_SHIFT,
> + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> + 0);
> +}
> +EXPORT_SYMBOL_GPL(cancel_rejected_write);
hm, is the pte-unmapping here completely solid? Are there any race windows
in which a faulter can end up owning, say, an anonymous page? We've had
heaps of problems with that sort of thing in generic_file_direct_IO() and I
don't expect it's this easy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/