Re: Suggestion for O_DIRECT and friends

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Mon, 21 Dec 1998 15:17:57 +0000


On Sun, Dec 20, 1998 at 04:13:24PM +0100, Arjan van de Ven wrote:
> O_Direct
> Hints the kernel that caching and read-ahead for requests to this
> file are not a good idea.
> * All files that you need only once (configuration files)

You probably _do_ want readahead for these, if they're of any
significant size, so that the application doesn't block on reading.

The existing cacheing strategies are probably fine for config files read
once, too. If not, I suggest a two-level cache status, like the old
directory name cache.

> O_Serialized
> Hints the kernel that the file is read in a serial/linear way,
> and that caching old results isn't usefull but that Read-ahead,
> Read-behind and Write-clustering are.
>
> Possible usage:
> * Multimedia applications (video-streaming, MP3-playing)

A good read-ahead implementation will catch on to this very quickly by
itself. Discarding the pages immediately after use is trickier... Then
again, how can an app be absolutely sure you're not about to stop it and
restart it again at the beginning?

> * tar, gzip and others

We already have write-behind and read-ahead. I'm not sure how clever
file write clustering is yet.

> * grep

Again, readahead is pretty good; perhaps it could be improved.
I see no need to tell the kernel about something it could recognise itself.

> * Reading configuration-files
>
> Kernels view:
> Read:
> Read-ahead is a good idea, so it can read-ahead more than usual.

There's no advantage in reading ahead further than is required to avoid
blocking the read() calls, for most of the calls. It would be a waste
of memory.

The kernel ought to have a self-tuning algorithm to do this kind of
read-ahead. I don't know if it does for read() yet. I do know it
doesn't do mmap() read-ahead, and the suggested algorithms for this
should be very good.

> Pages that have been sent to the appl. can be freed immediatly.

Both the application and the kernel can guess at this, but neither can
be sure in most cases. No media streaming program can be sure you don't
want to use the data again, though you might trust the hint a bit more.

My feeling is the kernel can guess this stuff very well, though maybe it
isn't that refined yet.

> Write:
> Cluster writes, once written to disk, free the pages immediatly.

I don't agree that discarding data one app predicts won't be required is
the right thing. However, discarding it _in preference_ to other data
read by the same app. would make sense.

> O_Entirely
> Hints the kernel that the entire contents in the file will be
> used. This way, read-ahead is definitly a good idea, and if
> the file is "small", it can be read entirely and asynchronously on
> fopen, even before any data is actually read. This can make the
> reads effectively "non-blocking"
>
> Possible usage:
> * in a C-compiler: .c and .h files
> * opening shell-scripts
> * in a webserver for documents that are to be sent
> * all _files_ with the sticky-bit

Good read-ahead will manage these nicely after it's got the hang of it.
For cases where you want to open and then read lots of files, you'll get
better results by making your application multi-threaded instead.

Perhaps there is a case for background read-ahead-when-device-idle of
newly opened files.

> Kernels view:
> On a fopen, all the blocks (for a "small" file) are scheduled to
> be read from "disk". For large files, only the first N blocks are
> scheduled.
> On a read, the blocks already are in memory (hopefully). For a large
> file, if more than M blocks are read in so far, the next M blocks
> are read-ahead pre-emtively

This is yet more read ahead, which read() should handle fine once it has
caught on. (The read ahead algorithms may need a little refinement).

You suggestions only save time between open() and read() if the
application does other things in between. IMO, that is quite unusual:
most apps. just open() then read() immediately. The others, could do
their thing using a thread to stimulate read ahead.

> Zero-copy TCP/IP
> It is possible to have a zero-copy TCP/IP stack, at least up to
> a certain extend. Don't ask me how;
> I know this because a fellow-student has implemented this in a
> "embedded TCP/IP" device [aka settop-box].

We know. However it's not clear if this is efficient in conjunction
with user-space MMU mappings and the standard socket API. Remapping
pages has its own overheads, and is not well suited to the usual
read()/write()/recv()/send(), unless everything is page aligned and page
sized. Receiving is particularly problematic.

On the bright side, zero-copy TCP/IP is already available in the kernel
itself. If you have one of those embedded applications, you can run it
in kernel mode.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/