Read-ahead in kernel

J.J. Burgess (92jjb@eng.cam.ac.uk)
Tue, 23 Apr 1996 18:36:24 +0100 (BST)


Swapping & read ahead improvements
----------------------------------

The current kernels have almost no idea about read ahead. The 'read ahead'
parameter currently settable simply limits the smallest request which is
issued to the lower layers of the (see linux/fs/block_dev.c). This is
the only place 'read_ahead' is used. Also although ll_rw_blk.c has a special
case for 'READA/WRITEA' for handling read and write ahead requests these are
very rarely every issued by anything (floppy.c being one exception) (the code
in ll_rw_blk.c will drop these if there are no free request slots available)

I don't know if 'breada()' code is used much but this is from fs/buffer.c

/* Request the read for these buffers, and then release them */
ll_rw_block(READ, j, bhlist);

shouldn't this be READA since this is read-ahead data?

To implement read_ahead properly, we need code either in the buffer cache or
ll_rw_blk to a) recognise patterns & b) get this read_ahead data & put it
somewhere.

a> To recognise patterns we could have a static per device (major+minor)
request usage register. It is fixed and allocated on a device wide basis
to speed up the access to it, as it will have to be checked or updated
for every request.

b> To get the data it can issue 'READA/WRITEA' requests, or requests to
the lower layers as per normal. Putting the data somewhere (for me) is a more
difficult question, can ll_rw_blk grab a few buffer cache areas and fill them
with data? If this was beiong done in the buffer cache code it might be easier.

An advantage to doing things in ll_rw_blk would be that the same routine could
be also used to swap requests, it complicated because swap requests go through
a seperate routinr in ll_rw_blk, this looks as if it is to speed things up
going more directly, although it bypasses some of the optimisation done in
ll_rw_block(), i guess the sleep behaviour is special also. Wouldn't it be
better to do the same request concatenation here also? or is that done to
optimise the number of request slots rather than disk transfer time?
Given the swap system has to fight for request slots with everything else,
might things go more responsively if everything other than swap had there
MAX_REQUESTs reduced a little, effectively guaranteeing free request slots.
Under interactive load a user will want his application swapped back into
RAM more than he cares about the data the process he's just started in another
window being able to read its data quickly.

One other concern of mine is the exact method of read_ahead. The dumb method
simply gets more data than was asked for and buffers it.
An example is better, suppose a task wants to read blocks 0123456789.

No read_ahead

Request 0
Block..
Disk reads 0
Proceeds...
Request 1
Block...
Disk reads 1
Proceeds...
etc.

Dumb read_ahead (1 block)

Request 0
Block...
Disk reads 01
Proceeds...
Request 1 - ok in cache.
Request 2
Block...
Disk reads 23
Proceeds...
Request 3 - ok in cache.
etc.

This is ok but to get a big improvement we may need large read_ahead which
wastes memory. A 'just in time method' is better

'Just in time' Read_ahead 1 block

Request 0
Block...
Disk reads 01
Proceeds...
Request 1 - ok in cache, but triggers
Disk read 2 (while still proceeding)
Request 2 - ok in cache, but triggers
Disk read 3 (while still proceeding)
Request 3 - ok in cache, but triggers
etc.

Hence process never sleeps awaiting data.

I believe however that only the buffer cache layer would be able to
do this kind of speculative read_ahead.
The thing is, this really needs to happen asynchronously to the requests,
a separate kernel thread 'kreadaheadd' perhaps?
This could simply issue speculative read requests, by looking at the current
situation of drive_requests and free memory.

The disk access usage tables contain:
(per major:minor block device)

current sector (end of last request)
typical(last?) request size;
last used time

On every request we have to scan the table, this is why acces must be quick
hence only very few entries per device. I guess we don't often get large
sets of sequentially read files in parallel, maybe 2 - 5 would be enough,
we only replace entries when they are LRU'ed out.

This still doesn't help a process like a 128M matrix calculation in swapped
memory, this does need the read_ahead_swap. To improve this we need both
speculative read and write_out swap, this is not necessarily true if the
data to be written is 'clean' and so doesn't get re-written to swap, otherwise
we need both speculative read and write swapping. At the moment the response
is probably
repeat {
[Fault page]
[get free page by swapping out page, disk seek, write page]
[disk seek, read new page]
}

What we wan't to do is increase the size of transfers to above a single page.
It would be nice if this used the same read_ahead routine of the buffer cache.

.. . . . . . . . . . . . . . ..
:: : : Jon Burgess 01223-461907 : : ::
:: : jjb1003@cam.ac.uk : : ::
:: : : : : : : : : : : : : : ::