Re: [PATCH]: PCI Error Recovery: Symbios SCSI device driver

From: Linas Vepstas
Date: Tue Oct 31 2006 - 18:14:02 EST


On Tue, Oct 31, 2006 at 11:55:07AM -0700, Matthew Wilcox wrote:
> On Fri, Oct 20, 2006 at 01:05:10PM -0500, Linas Vepstas wrote:
> > Index: linux-2.6.19-rc1-git11/drivers/scsi/sym53c8xx_2/sym_glue.c
>
> This needs to be before the printf_debug call.

Right. This'll be in the next patch submision.

> > @@ -726,6 +731,19 @@ static int sym_eh_handler(int op, char *
> > + /* We may be in an error condition because the PCI bus
> > + * went down. In this case, we need to wait until the
> > + * PCI bus is reset, the card is reset, and only then
> > + * proceed with the scsi error recovery. There's no
> > + * point in hurrying; take a leisurely wait.
> > + */
> > +#define WAIT_FOR_PCI_RECOVERY 35
> > + if ((np->s.device->error_state != pci_channel_io_normal) &&
> > + (np->s.device->error_state != 0) &&
> > + (wait_for_completion_timeout(&np->s.io_reset_wait,
> > + WAIT_FOR_PCI_RECOVERY*HZ) == 0))
> > + return SCSI_FAILED;
> > +
>
> Is it safe / reasonable / a good idea to sleep for 35 seconds in the EH
> handler? I'm not that familiar with how the EH code works. It has its
> own thread, so I suppose that's OK.

As James pointed out, the pci channel is is not available until the
reset sequence is done. The 35 seconds seemed like a reasonable time
to wait for the pci reset to complete; hopefuly it will complete much
sooner. If the pci reset fails for some reason, then things are hosed,
and the sysadmin will need to intervene.

> Are the driver's data structures still intact after a reset?

They should be. No one is attempting to free or shut down the driver.

> I generally prefer not to be so perlish in conditionals, ie:

I wasn't sure what style is popular. Actually, I agree with you on
that, and picked the other style cause i thought it was prefered.
Nex patch submission will be more nested.

> if ((np->s.device->error_state != pci_channel_io_normal) &&
> (np->s.device->error_state != 0) {
>
> Why is the condition so complicated though? What does 0 mean if it's
> not io_normal?

Its an unresolved stupidity. For some imagined, hypothetical
reason, it momentarily seemed wise to make pci_channel_io_normal
be non-zero; but this imagined reason, although vague, did manage
to bite back, as the above code demonstrates.

> At least let's hide that behind a convenience macro:
>
> if (abnormal_error_state(np->s.device->error_state)) {

Should I submit a patch to make pci_channel_io_normal be zero,
or submit a patch to define abnormal_error_state, or both?
Both, probably; I don't have much of an opinion.

> Though, since INB and INW will return 0xff and 0xffff, why not use that
> as our test rather than using a counter?

Right. I wanted to avoid checking for specific values,
as that vaguely seemed more robust; the direct check is easier.
I'll change this.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/