Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue

From: Dave Chinner
Date: Sat Dec 15 2012 - 22:36:01 EST


On Sun, Dec 16, 2012 at 03:04:42AM +0000, Eric Wong wrote:
> Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> > > Applications streaming large files may want to reduce disk spinups and
> > > I/O latency by performing large amounts of readahead up front.
> > > Applications also tend to read files soon after opening them, so waiting
> > > on a slow fadvise may cause unpleasant latency when the application
> > > starts reading the file.
> > >
> > > As a userspace hacker, I'm sometimes tempted to create a background
> > > thread in my app to run readahead(). However, I believe doing this
> > > in the kernel will make life easier for other userspace hackers.
> > >
> > > Since fadvise makes no guarantees about when (or even if) readahead
> > > is performed, this change should not hurt existing applications.
> > >
> > > "strace -T" timing on an uncached, one gigabyte file:
> > >
> > > Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
> > > After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> >
> > You've basically asked fadvise() to readahead the entire file if it
> > can. That means it is likely to issue enough readahead to fill the
> > IO queue, and that's where all the latency is coming from. If all
> > you are trying to do is reduce the latency of the first read, then
> > only readahead the initial range that you are going to need to read...
>
> Yes, I do want to read the whole file, eventually. So I want to put
> the file into the page cache ASAP and allow the disk to spin down.

Issuing readahead is not going to speed up the first read. Either
you will spend more time issuing all the readahead, or you block
waiting for the first read to complete. And the way you are issuing
readahead does not guarantee the entire file is brought into the
page cache....

> But I also want the first read() to be fast.

You can't have a pony, sorry.

> > Also, Pushing readahead off to a workqueue potentially allows
> > someone to DOS the system because readahead won't ever get throttled
> > in the syscall context...
>
> Yes, I'm a little worried about this, too.
> Perhaps squashing something like the following will work?
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 56a80a9..51dc58e 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -246,16 +246,18 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
> {
> struct wq_ra_req *req;
>
> + nr_to_read = max_sane_readahead(nr_to_read);
> + if (!nr_to_read)
> + goto skip_ra;

You do realise that anything you read ahead will be accounted as
inactive pages, so nr_to_read doesn't decrease at all as you fill
memory with readahead pages...

> +
> req = kzalloc(sizeof(*req), GFP_ATOMIC);

GFP_ATOMIC? Really?

In reality, I think you are looking in the wrong place to fix your
"first read" latency problem. No matter what you do, there is going
to be IO latency on the first read. And readahead doesn't guarantee
that the pages are brought into the page cache (ever heard of
readahead thrashing?) so the way you are doing your readahead is not
going to result in you being able to spin the disk down after
issuing a readahead command...

You've really got two problems - minimal initial latency, and
reading the file quickly and pinning it in memory until you get
around to needing it. The first can't be made faster by using
readahead, and the second can not be guaranteed by using readahead.

IOWs, readahead is the wrong tool for solving your problems. Minimal
IO latency from the first read will come from just issuing pread()
after open(), and ensuring that the file is read quickly and pinned
in memory can really only be done by allocating RAM in the
application to hold it until it is needed....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/