Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to theinode where possible
From: Andrew Morton
Date: Fri Jun 11 2010 - 02:11:36 EST
On Tue, 8 Jun 2010 10:02:24 +0100 Mel Gorman <mel@xxxxxxxxx> wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.
>
> This patch recognises that there are cases where a number of pages
> belonging to the same inode are being written out. When this happens and
> writepages() is implemented, the range of pages will be written out with
> a_ops->writepages. The inode is pinned and the page lock released before
> submitting the range to the filesystem. While this potentially means that
> more pages are cleaned than strictly necessary, the expectation is that the
> filesystem will be able to writeout the pages more efficiently and improve
> overall performance.
>
> ...
>
> + /* Write single page */
> + switch (write_reclaim_page(cursor, mapping, PAGEOUT_IO_ASYNC)) {
> + case PAGE_KEEP:
> + case PAGE_ACTIVATE:
> + case PAGE_CLEAN:
> + unlock_page(cursor);
> + break;
> + case PAGE_SUCCESS:
> + break;
> + }
> + } else {
> + /* Grab inode under page lock before writing range */
> + struct inode *inode = igrab(mapping->host);
> + unlock_page(cursor);
> + if (inode) {
> + do_writepages(mapping, &wbc);
> + iput(inode);
Buggy.
I did this, umm ~8 years ago and ended up reverting it because it was
complex and didn't seem to buy us anything. Of course, that was before
we broke the VM and started writing out lots of LRU pages. That code
was better than your code - it grabbed the address_space and did
writearound around the target page.
The reason this code is buggy is that under extreme memory pressure
(<oldfart>the sort of testing nobody does any more</oldfart>) it can be
the case that this iput() is the final iput() on this inode.
Now go take a look at iput_final(), which I bet has never been executed
on this path in your testing. It takes a large number of high-level
VFS locks. Locks which cannot be taken from deep within page reclaim
without causing various deadlocks.
I did solve that problem before reverting it all but I forget how. By
holding a page lock to pin the address_space rather than igrab(),
perhaps. Go take a look - it was somewhere between 2.5.1 and 2.5.10 if
I vaguely recall correctly.
Or don't take a look - we shouldn't need to do any of this anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/