Block device interface could use some work

Theodre Ts'o (tytso@mit.edu)
Mon, 4 Nov 1996 09:30:35 -0500


I'm currently writing a tutorial for the USENIX/USELINUX conference,
"How to write device drivers for Linux". As a result, I've had to dive
into kernel abstractions which I haven't seriously looked at since, oh,
the 1.0/1.2 kernel days.

After looking at the code (fortunately, I had not just eaten), I've come
to the conclusion that the block driver code could use some serious
rototilling. Unlike the tty or SCSI abstraction, there are
device-dependent sections of code in blk.h, ll_rw_blk.h, and other
routines spread across the block device abstraction. The abstraction
boundary itself is not particularily clean, as the SCSI devices, the
CD-ROM devices, and the IDE device drivers feel free to violate the
abstraction boundary and directly muck with the request queue directly,
for performance reasons. (What this probably means is that some of the
things they are trying to do should be incorporated into the blk.h
abstraction.) As a final example, the end_request() function is a
statically defined function defined in blk.h, which is replicated for
each device driver.

More seriously, the block driver interface is extremely fragile in the
face of device errors. Take a look at the definition of end_request()
in blk.h; if there is an error, there is an (incorrect) attempt to try
to adjust the request parameters, which is absolutely pointless since
the current request will then be thrown away at end of this function.

if (!uptodate) {
printk("end_request: I/O error, dev %s, sector %lu\n",
kdevname(req->rq_dev), req->sector);
/* XXX this is all pointless code XXX */
req->nr_sectors--;
req->nr_sectors &= ~SECTOR_MASK;
req->sector += (BLOCK_SIZE / 512);
req->sector &= ~SECTOR_MASK;
}

Later in the function, all the buffers associated with this request have
the their uptodate flag set to false, indicating that the block was not
read in due to a device error.

Unfortunately (at least for this analysis), the high-level code tries to
merge adjacent requests for performance reasons. Hence, a request may
span a large number of blocks. If there is a bad block is in a middle
of a multi-block request, all of the blocks associated in the
multi-block request will be marked as being in error, even ones which
(a) were in fact successfully read, and (b) those which could have been
successfully read if the device driver had actually tried to read them.

The bottom line is that the Linux block driver abstraction is screaming
for a re-write. I currently have too many Linux projects on my plate
(including the aforementioned USELINUX tutorial which caused me to
notice this problem in the first place), so I'm settling for writing a
note to linux-kernel pointing out the problem. Perhaps some of the
people who are currently maintaining block devices would be willing to
take a look at this? Linus? What do you think?

- Ted