Re: O_DIRECT question

From: Phillip Susi
Date: Mon Jan 22 2007 - 11:18:01 EST


Denis Vlasenko wrote:
The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use signals.
Do we have this 'something'? I honestly don't know).

The application _IS_ using aio, which is why it can go and perform other work while it waits to be told that the read has completed. This is not possible with mmap because the task is blocked while faulting in pages, and unless it tries to access the pages, they won't be faulted in.

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

Being blocked in the page fault handler means the cpu is now idle because you can't go chew on data that _IS_ in core. The aio + O_DIRECT allows you to control when IO is started rather than rely on the kernel to decide when is a good time for readahead, and to KNOW when that IO is done so you can chew on the data.

With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".

You have this completely backwards. With mmap this is what you get because you chew data, page fault... chew data... page fault...

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

Ever hear of error recovery? A good db will be able to cope with one or two bad blocks, or at the very least continue operating the other tables or databases it is hosting, or flush transactions and switch to read only mode, or any number of things other than abort().

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

Sure... now what does your signal handler have to do in order to handle this error in such a way as to allow the one request to be failed and the task to continue handling other requests? I don't think this is even possible, yet alone clean.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

Or use aio. Simple read/write with the kernel trying to outsmart the application is nice for very simple applications, but it does not provide very good performance. This is why we have aio and O_DIRECT; because the application can manage the IO better than the kernel because it actually knows what it needs and when.

Yes, the application ends up being more complex, but that is the price you pay. You simply can't get it perfect in a general purpose kernel that has to guess what the application is really trying to do.

You think "Oracle". But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.

Then redesign those applications to use aio and O_DIRECT. Incidentally I have hacked up dd to do just that and have some very nice performance numbers as a result.

Well, I too currently work with Oracle.
Apparently people who wrote damn thing have very, eh, Oracle-centric
world-view. "We want direct writes to the disk. Period." Why? Does it
makes sense? Are there better ways? - nothing. They think they know better.

Nobody has shown otherwise to date.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/