Re: "raw" block devices?

Ingo Molnar (mingo@pc5829.hil.siemens.at)
Thu, 17 Oct 1996 19:10:53 +0100 (MET)


On Thu, 17 Oct 1996, Linus Torvalds wrote:

> > Also this allows eg database systems to be given a slice of disk which they
> > are in complete control of, and can maybe manage better than the normal
> > buffering (known access patterns etc).
>
> That's a theory I don't subscribe to myself.
>
> Sure, there are old-fashioned databases that think they can do a better job
> of it than the kernel does. They are usually wrong, I suspect. They are using
> raw devices more for historical reasons than anything else, and they could
> just as well use a filesystem.

[yes, raw devices are a hack, still RDBMS ppl use it because:]

one not-so obvious problem is that an RDBMS >has< to implement a
write-cache for itself. Thus if the block device would be buffered too (in
the kernel), then we had double buffering. [as it is buffered now]

The kernel write cache spontanously writes data to the device, and this is
not good for an RDBMS: it has to be sure that the cached data first
touches the log, then only the actual database. If the kernel provided
such a functionality, then RDBMSs could efficiently use the kernel
buffering.

So RDBMSs like Oracle just do the following: SYSV shared memory as a
write-cache, raw devices as database. Files >can< be the database too, but
in that case we have double buffering. (which isnt too bad due to
brilliant RDBMS designs, but which in turn results in slightly better
performance data for raw devices). Oracle doesnt mmap() files directly,
because it theoretically CANT: it has to guarantee transaction protection.

If we could guarantee for a database that a page wont be written out only
if the database server wants so ... then using mmap() for a database would
be a much cleaner design [and probably would be faster too].

[i dont know how well mlock() is suited to do this task ... but i suspect
it's rather for preventing paging, not for implementing an RDBMS-type of
write-cache]

Ingo