Re: mmap over nfs leads to excessive system load

From: Andrew Morton
Date: Wed Nov 16 2005 - 02:48:42 EST


Kenny Simpson <theonetruekenny@xxxxxxxxx> wrote:
>
> I ran the same test again against 2.6.15-rc, and got pretty much the same thing. It starts nice
> and fast (30+MB/s, but drops down to under 10MB/s with the system time pegging one CPU).
>
> Here is the oprofile result:
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 2658.47 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask
> of 0x01 (mandatory) count 100000
> samples % symbol name
> 412585 14.6687 find_get_pages_tag
> 343898 12.2267 mpage_writepages
> 290144 10.3155 release_pages
> 288631 10.2617 unlock_page
> 286181 10.1746 pci_conf1_write
> 267619 9.5147 clear_page_dirty_for_io
> 128128 4.5554 __lookup_tag
> 120895 4.2982 page_waitqueue
> 52739 1.8750 _spin_lock_irqsave
> 43623 1.5509 skb_copy_bits
> 30157 1.0722 __wake_up_bit
> 29973 1.0656 _read_lock_irqsave
>

Your application walks the file in 2MB hunks, doing ftruncate() each time
to expand the file by another 2MB.

nfs_setattr() implements the truncate. It syncs the whole file, using
filemap_write_and_wait() (that seems a bit suboptimal. All we're doing is
increasing i_size??)

So filemap_write_and_wait() has to write 2MB's worth of pages. Problem is,
_all_ the pages, even the 99% which are clean are tagged as dirty in the
pagecache radix tree. So find_get_pages_tag() ends up visiting each page
in the file, and blows much CPU doing so.

The writeout happens in mpage_writepages(), which uses
clear_page_dirty_for_io() to clear PG_dirty. But it doesn't clear the
dirty tag in the radix tree. It relies upon the filesystem to do the right
thing later on. Which is all very unpleasant, sorry. See the explanatory
comment over clear_page_dirty_for_io().

nfs_writepage() doesn't do any of the things which that comment says it
should, hence the radix tree tags are getting out of sync, hence this
problem.

NFS does strange, incomprehensible-to-little-akpms things in its writeout
path. Ideally, it should run set_page_writeback() prior to unlocking the
page and end_page_writeback() when I/O completes. That'll keep the VM
happier while fixing this performance glitch.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/