Re: Downsides to madvise/fadvise(willneed) for application startup
From: Wu Fengguang
Date: Wed Apr 07 2010 - 03:39:04 EST
On Wed, Apr 07, 2010 at 10:54:58AM +0800, Taras Glek wrote:
> On 04/06/2010 07:24 PM, Wu Fengguang wrote:
> > Hi Taras,
> > On Tue, Apr 06, 2010 at 05:51:35PM +0800, Johannes Weiner wrote:
> >> On Mon, Apr 05, 2010 at 03:43:02PM -0700, Taras Glek wrote:
> >>> Hello,
> >>> I am working on improving Mozilla startup times. It turns out that page
> >>> faults(caused by lack of cooperation between user/kernelspace) are the
> >>> main cause of slow startup. I need some insights from someone who
> >>> understands linux vm behavior.
> > How about improve Fedora (and other distros) to preload Mozilla (and
> > other apps the user run at the previous boot) with fadvise() at boot
> > time? This sounds like the most reasonable option.
> That's a slightly different usecase. I'd rather have all large apps
> startup as efficiently as possible without any hacks. Though until we
> get there, we'll be using all of the hacks we can.
Boot time user space readahead can do better than kernel heuristic
readahead in several ways:
- it can collect better knowledge on which files/pages will be used
which lead to high readahead hit ratio and less cache consumption
- it can submit readahead requests for many files in parallel,
which enables queuing (elevator, NCQ etc.) optimizations
So I won't call it dirty hack :)
> > As for the kernel readahead, I have a patchset to increase default
> > mmap read-around size from 128kb to 512kb (except for small memory
> > systems). This should help your case as well.
> Yes. Is the current readahead really doing read-around(ie does it read
> pages before the one being faulted)? From what I've seen, having the
Sure. It will do read-around from current fault offset - 64kb to +64kb.
> dynamic linker read binary sections backwards causes faults.
There are too many data in
Can you show me the relevant lines? (wondering if I can ever find such lines..)
> >>> Current Situation:
> >>> The dynamic linker mmap()s executable and data sections of our
> >>> executable but it doesn't call madvise().
> >>> By default page faults trigger 131072byte reads. To make matters worse,
> >>> the compile-time linker + gcc lay out code in a manner that does not
> >>> correspond to how the resulting executable will be executed(ie the
> >>> layout is basically random). This means that during startup 15-40mb
> >>> binaries are read in basically random fashion. Even if one orders the
> >>> binary optimally, throughput is still suboptimal due to the puny readahead.
> >>> IO Hints:
> >>> Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb
> >>> reads and a binary that tends to take 110 page faults(ie program stops
> >>> execution and waits for disk) can be reduced down to 6. This has the
> >>> potential to double application startup of large apps without any clear
> >>> downsides.
> >>> Suse ships their glibc with a dynamic linker patch to fadvise()
> >>> dynamic libraries(not sure why they switched from doing madvise
> >>> before).
> > This is interesting. I wonder how SuSE implements the policy.
> > Do you have the patch or some strace output that demonstrates the
> > fadvise() call?
> glibc-2.3.90-ld.so-madvise.diff in
550 Can't open
No such file or directory
OK I give up.
> As I recall they just fadvise the filedescriptor before accessing it.
Obviously this is a bit risky for small memory systems..
> >>> I filed a glibc bug about this at
> >>> http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented
> >>> with his concern about wasting memory resources. What is the impact of
> >>> madvise(WILLNEED) or the fadvise equivalent on systems under memory
> >>> pressure? Does the kernel simply start ignoring these hints?
> >> It will throttle based on memory pressure. In idle situations it will
> >> eat your file cache, however, to satisfy the request.
> >> Now, the file cache should be much bigger than the amount of unneeded
> >> pages you prefault with the hint over the whole library, so I guess the
> >> benefit of prefaulting the right pages outweighs the downside of evicting
> >> some cache for unused library pages.
> >> Still, it's a workaround for deficits in the demand-paging/readahead
> >> heuristics and thus a bit ugly, I feel. Maybe Wu can help.
> > Program page faults are inherently random, so the straightforward
> > solution would be to increase the mmap read-around size (for desktops
> > with reasonable large memory), rather than to improve program layout
> > or readahead heuristics :)
> Program page faults may exhibit random behavior once they've started.
> During startup page-in pattern of over-engineered OO applications is
> very predictable. Programs are laid out based on compilation units,
> which have no relation to how they are executed. Another problem is that
> any large old application will have lots of code that is either rarely
> executed or completely dead. Random sprinkling of live code among mostly
> unneeded code is a problem.
> I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB
> with proper binary layout. Even if one lays out a program wrongly, the
> worst-case pagein pattern will be pretty similar to what it is by default.
That's great. When will we enjoy your research fruits? :)
> But yes, I completely agree that it would be awesome to increase the
> readahead size proportionally to available memory. It's a little silly
> to be reading tens of megabytes in 128kb increments :) You rock for
> trying to modernize this.
Thank you. I guess the 128kb is more than ten years old..
> >>> Also, once an application is started is it reasonable to keep it
> >>> madvise(WILLNEED)ed or should the madvise flags be reset?
> >> It's a one-time operation that starts immediate readahead, no permanent
> >> changes are done.
> > Right. The kernel regard WILLNEED as a readahead request from userspace.
> >>> Perhaps the kernel could monitor the page-in patterns to increase the
> >>> readahead sizes? This may already happen, I've noticed that a handful of
> >>> pagefaults trigger> 131072bytes of IO, perhaps this just needs tweaking.
> >> CCd the man :-)
> > Thank you :)
> > Cheers,
> > Fengguang
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/