Re: [PATCH/RFC] I/O-check interface for driver's error handling

From: Linas Vepstas
Date: Tue Mar 01 2005 - 14:01:16 EST


On Tue, Mar 01, 2005 at 10:08:48AM -0800, Linus Torvalds was heard to remark:
>
> On Tue, 1 Mar 2005, Andi Kleen wrote:
> >
> > But what would the default handling be? It would be nice if there
> > was a simple way for a driver to say "just shut me down on an error"
> > without adding iochk_* to each function. Ideally this would be just
> > a standard callback that knows how to clean up the driver.
>
> There can't be any.
>
> The thing is, IO errors just will be very architecture-dependent. Some

:)
FWIW, I've got a working prototype that is ppc64-architecture specific
and I've been yelled at for not proposing an architecture-generic API
and so my current goal is to hash out enough commonality with Seto to
come up with a generic API that's acceptable for the mainline kernel.

(FYI, the 'prototype' currently ships with Novell/SUSE SLES9, but
I haven't been sucessful in getting the patches in upstream.)

> might have exceptions happening, without the exception handler really
> having much of an idea of who caused it, unless that driver had prepared
> it some way, and gotten the proper locks.

Most hotplug-capable drivers are "most of the way there", since they
can deal with the sudden loss of i/o to the pci card, and know how
to clean themselves up.

> A non-converted driver just doesn't _do_ any of that. It doesn't guarantee
> that it's the only one accessing that bus, since it doesn't do the
> "iocheck_clear()/iocheck_read()" things that imply all the locking etc.

Yes; for example, the pci error might affect multiple device drivers.
The proposal is that there should be a "master cleanup thread" to
deal with this (see my other email).

> Shutting down the hardware by default might be a horribly bad thing to do

The current ppc64 prototype code does a pci-hotplug-remove/hotplug-add
if we've detected a pci error, and the affected device driver doesn't
know what to do. This works for ethernet cards, but can't work for
anything with a file system on it (because a pci-hotplug-remove
on a scsi card trickles up to the block device, which trickles up to
the file system, which can't be unmounted post-facto.)

> In fact, I'd argue that even a driver that _uses_ the interface should not
> necessarily shut itself down on error. Obviously, it should always log the
> error, but outside of that it might be good if the operator can decide and
> set a flag whether it should try to re-try (which may not always be
> possible, of course), shut down, or just continue.

On ppc64, "just continue" is not an option; the pci slot is "isolated"
all i/o is blocked, including dma.

The current prototype code tells the device driver that the pci slot is
hung, then it resets the slot, then it tells the device driver that the
pci slot is good-to-go again.

My goal is to negotiate a standard set of interfaces in struct pci_driver
to do the above. (see other email).

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/