Re: [RFC] - Some notions that I would like comments on

Chuck Lever (cel@monkey.org)
Sun, 18 Jul 1999 14:37:35 -0400 (EDT)


On Sat, 17 Jul 1999, Stephen C. Tweedie wrote:
> > so, if you want a page fault to trigger the next cluster too, a way of
> > doing that easily with the current code base is to schedule all the reads
> > for the current cluster, then schedule all the reads for the next cluster,
> > then wait for the requested page. that's almost identical to doubling the
> > cluster size.
>
> Yes. This is exactly what you'd want to optimise MAP_SEQUENTIAL maps.
>
> > however, if the cluster size is 128k, and the requested page is in the
> > second half of the cluster, then you've "read behind."
>
> Not quite. We don't trigger the cluster readaround unless the page is
> missing from the cache altogether. For sequential accesses, we
> basically guarantee that all accesses to the rest of the cluster are in
> cache once the first page fault in the cluster has been taken. The IO
> may not be complete yet, but the cache data structures will be in place.
> In other words, the clustering readaround will trigger a readbehind for
> sequential accesses.

right - i should have specified that the "readbehind" has negative effects
for random accesses, but are not much of an issue for sequential accesses
of an mmap'd file.

the change to make the kernel read the next cluster during a page fault is
actually very very simple. in mm/filemap.c filemap_nopage() ::

no_cached_page:
/*
* Try to read in an entire cluster at once.
*/
reada = offset;
reada >>= PAGE_CACHE_SHIFT + page_cluster;
reada <<= PAGE_CACHE_SHIFT + page_cluster;

for (i = 1 << (page_cluster+1); i>0; --i,reada += PAGE_CACHE_SIZE)
^^^^^^^^^^^^^^^^
new_page = try_to_read_ahead(file, reada, new_page);

now, this doesn't do anything special for MAP_SEQUENTIAL mmaps, but it
gives you enough to try testing. i've found that it helps performance
somewhat on a 128M server with SCSI disks, but hinders performance
significantly on a 64M laptop with an IDE disk (no cache).

aside: many disk drives i've encountered have a track cache in the 35-50K
per-track range. i'm wondering (aloud) whether using a 32K cluster size
might be a better use of the disk track cache?

anyway, the following might be more interesting, since it preserves
existing sequential readahead behavior while limiting read-behind during
random accesses:

no_cached_page:
/*
* Try to read in an entire cluster at once.
*/
reada = offset;
reada >>= PAGE_CACHE_SHIFT + page_cluster - 2;
^^^^
reada <<= PAGE_CACHE_SHIFT + page_cluster - 2;
^^^^

for (i = 1 << page_cluster; i > 0; --i, reada += PAGE_CACHE_SIZE)
new_page = try_to_read_ahead(file, reada, new_page);

this snippet aligns cluster page-ins to quarter-cluster boundaries. this
seems to work a little better than page- or cluster-aligned cluster
read-ins. i've been testing this with a benchmark suite configured to
take up almost all available memory. i've found that the number of
page-ins required to run the benchmark the second time (that is, after the
caches are warm) is reduced significantly, which i take to mean that the
number of unnecessary page-ins has been reduced, when compared against the
stock kernel.

one might also note that this "for" loop will happily continue past the
end of file. this is safe since try_to_read_ahead simply returns if the
request is off the end of the file, but really, this loop should stop as
soon as reada is larger than inode->i_size. likewise in
generic_file_readahead.

there are several other more significant optimizations that can be made
here. i can post a patch against late 2.3 if anyone's interested.

- Chuck Lever

--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/