Re: [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED

From: Johannes Weiner
Date: Mon Feb 29 2016 - 14:42:14 EST

On Tue, Feb 23, 2016 at 06:34:59PM -0800, Linus Torvalds wrote:
> Why do you think that "Just do what the user asked for" is obviously
> the right thing?

In our situation, we are trying to prime the cache for what we know
will be the definite workingset of the application. We don't care if
it maxes out the IO capacity, and we don't care if it throws out any
existing cache to accomplish the task. In fact, if you're sure about
the workingset, that is desired behavior. It's basically read(), but
without the pointless copying and waiting for completion.

One of the mistakes I made was to look only at the manpage, and not at
how readahead() is or has historically been used in the field.

One such usecase is warming the system during bootup, where system
software fires off readahead against all manner of libraries and
executables that are likely to be used. In that scenario the caller
really doesn't know for sure it's reading the right thing. And if not,
the optimistic readahead shouldn't vaccuum up all the resources and
interfere with the IO and memory demands of the *actual* workingset.

It seems that the optimistic readahead during bootup is being phased
out nowadays. Systemd took over with systemd-readahead, then dropped
it eventually citing lack of desired performance benefits and
relevance; there is another project called preload but it appears
defunct as well. For all we know, though, there still are people who
fire off optimistic readahead, and we can't regress them. Certainly
older or customized userspace still running bootup readahead, or maybe
comparable applications where workingsets are estimated by heuristics.

It's unfortunate, because I frankly doubt we ever got the "else" part,
the not-interfering-with-the-real-workload part, working anyway. The
fact that distros are moving away from it or that we ended up limiting
the window to near-ineffective levels seem to be a symptoms of that.
That means the facility is now stuck somewhere in between questionable
for optimistic readahead and not useful for reliable cache priming.

We can't really make it work for both cases as their requirements are
in direct conflict with each other. Lowering the limit from cache+free
to 128k was a regression for priming a big known workingset, but there
is also no point in going back now and risk regressing the other side.

So it appears best to add a new syscall with clearly defined semantics
to forcefully prime the cache.

That, or switch to read() from a separate thread for cache priming.