Re: On the subject of the VFS layer (was Re: VFS questions)

Doug Ledford (
Sat, 3 May 1997 18:47:35 -0500 (CDT)

On Sat, 3 May 1997, Michael Neuffer wrote:

> On Sat, 3 May 1997, David S. Miller wrote:
> > Date: Sat, 3 May 1997 19:38:32 +1200 (NZST)
> > From: "J. Sean Connell" <>
> >
> > Even though it would require a total overhaul of every filesystem
> > and the VFS layer, it doesn't seem to me that I could cheerfully
> > use Linux in a zero-fault-tolerance environment when the fs code
> > simply disowns the fs when it finds a single measly bad block...
> >
> > The entire VFS layer is going to be overhauled in the near future, and
> > such issues will be considered heavily along the way.
> >
> > (As a side note, in a "zero fault tolerance" environment, you wouldn't
> > buy disks which could ever report bad blocks, you'd use some sort of
> > RAID strategy, all done in hardware, all mirrored and parity checked,
> > where it's "bad block, what's that?")
> Even then you currently run into trouble.
> In case of a failure of an element in a RAID array, your "transaction
> time" gets absolutely undeterminable. The current SCSI code is not
> able to handle this and will basically freak out and start issuing
> aborts and resets. All the logic when a command times out and how to
> handle it (ie. all the strategy decisions) will have to be moved in the
> lowlevel drivers since this is the only place where such decisions can be
> made.

YIKES! Making timeout decisions for a hardware RAID controller in the low
level driver may make sense, but for the rest of the controllers out there
that act upon each drive as a single and separate entity, this is a scary
and needless prospect. Of course, you realize how much more code it would
add if each driver did these things itself? We already have a significant
amount of bloat in the low level drivers as each one handles its own
queueing responsibilities, etc. This would simply add more to that.
Then, if you have someone such as myself who routinely has several
different SCSI controllers compiled into their kernel, that code adds up.
Not to mention that when you split things out of the mid level code this
way, then you are talking about having the inherent problem of each driver
having to get this timeout logic correct on their own. You would no
longer have a generic scsi layer that (suppossedly) works, but instead you
would have "This driver properly handles timeouts, this driver flakes out
when a command times out" etc.

In truth, the only scenario you are bringing up that needs this kind of
control is on a hardware RAID controller that does some kind of live
rebuild for a lost device. Then you have the situation of having to wait
on commands to complete because the controller is busy rebuilding that
lost drive and handles your commands as it can. For non hardware-RAID
controllers (the vast majority of those out there :) this wouldn't buy us
anything since those controllers don't go around doing intelligent things
on the SCSI bus behind our back. In truth, the best way to handle
something like this might be that at controller registration time, a
hardware RAID controller be allowed to pass a flag back to the mid-level
scsi code letting it know that it should handle timeout problems. Then,
the mid level scsi code could be modified to do one of two things. It
could simply not set any timers on any commands going to that controller,
or it could be modified so that it still sets timers on the code, but when
a timer expires, if a particular function is registered as part of the
controller driver, then it passes the handling of that timeout to the
controller driver. This avoids having the low level driver set its own
timeout values, but still allows that driver to handle the timeout
operation in case of some catastrophic failure making the bus extremely
slow. This would probably be the easiest way of getting this option going
in the mid-level code. You would add a new entry into the scsi_host
template for a function pointer to a timeout handler function. If that
entry is non-NULL, then at timeout, instead of using the mid-level code's
logic, it passes the timeout down to the function registered for that
controller. Then the low level driver could handle the logic itself and
decide upon a course of action. I think this would address the problem of
non-deterministic command times during a rebuild, while also leaving the
handling of timeouts in the mid level code for those drivers that don't
need/want to handle it themselves.

Another possibility would be to add a new driver function that doesn't
actually handle the timeout, but does provide suggestions to the mid level
code for instances just such as this. In this way, you could have
scsi_timesout call into the suggestion routine before making a decision on
what to do, then the suggestion routine could pass back SUGGEST_ABORT,
SUGGEST_RESET, SUGGEST_EXTEND in which case the mid level code could call
the abort routine, or the reset routine, or it could extend the timeout on
the command based upon the suggestion. If no suggest routine is
registered, then the mid level code could go about business as normal
using its own algorithms.

IMHO, the suggestion routine would be the easiest and the quickest to
implement, would allow a hardware RAID controller to extend timeouts on
commands instead of going into an abort/reset routine immediately, and
would accomplish the goal you are looking for with a minimum of changes.
It has the side benefit of being able to be used for a few other cases as
well. For example, in the aic7xxx driver, if we get a TARGET_BUSY status
or a QUEUE_FULL status back from a drive, we have one of two choices.
Either requeue the command and hope it doesn't time out, or fiddle with
the status byte to make it look normal, then send it back to the mid level
code with a DID_BUS_BUSY result so it will get requeued to us. A
suggestion routine would allow us to requeue the command, and if it does
timeout, then extend the timeout. We could also keep track of how many
times we have extended the timeout in this way since we wouldn't be
sending the command back and therefore wiping out our own scb information,
and using that we could switch to abort after maybe 5 extensions or
something of that nature. This actually comes in handy in a few
situations, such as two controllers from two machines on the same scsi
bus, where one machine has a drive tied up and the other is getting target
busy messages as a result. This keeps us from resetting the bus due to
another machine hogging a device, at least up to a point where we say
"OK, that machine isn't giving the thing up, so we reset to try and get
things back to normal" and pass the suggestion up to the mid level code.


* Doug Ledford * Unix, Novell, Dos, Windows 3.x, *
* 873-DIAL * WfW, Windows 95 & NT Technician *
* PPP access $14.95/month *****************************************
* Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
* communities. Sign-up online at * Web page creation and hosting, other *
* 873-9000 V.34 * services available, call for info. *