Re: oh woe, oh woe, oh woe is me... [2.0.34p16 + 3com]

Donald Becker (becker@cesdis1.gsfc.nasa.gov)
Wed, 27 May 1998 10:47:53 -0400 (EDT)


On Wed, 27 May 1998, Chris Evans wrote:

> On Wed, 27 May 1998, Matthew Kirkwood wrote:
>
> > Network driver in (1) was a 3c59x as shipped in 34p16 with the transmit
> > protect patch from DaveM; in (2) it was the vanilla 34p16 driver.
>
> Actually driver driver in 34p16 _has_ the DaveM protect patch, by the
> looks of it. I extended the save_flags() protection in (1) to encompass
> more of the transmit routine, effectively making the send and receieve
> routines on the card mutually exclusive. We'll see if that helps.

Well, sure, you can put mutual exclusion around everything. But that
doesn't directly fix a real problem, it just changes the timing so that the
problem doesn't occur. The real problem is that some SMP machines can
make multiple calls the interrupt handler, and the cli() synchronizes the
processors to reduce the occurence of that bug.

> > Are either of these worth investigating? (Chris reckons that the oopses
> > are due to an unprotected error path in 3c59x which dereferences NULL
> > pointers left, right and centre.)

OK, which error path?
It's easy to guess at these things, and to put locks and null-checks
everywhere, but I'm not going to put the changes into my copy of the driver
unless you can show a problem that should be solved at the individual driver
level.

Yes, I can be stubborn, but I've been the one spending a large part of every
day answering support questions, so I feel that I have a right (and
responsibility -- see below) to be stubborn. It's easy to add
null-check-everything patches to someone elses code when you don't have to
maintain it or be concerned about the performance.

On a loosely related flame: As a community we are doing a very bad job at
minimizing kernel bloat. It's too easy to add new code, and few people that
add code are willing to support it for more than a few months. No one is
saying "that feature is neat, but it's not useful for most users. Keep it
as a seperate kernel patch". Example: years ago we made a decision not to
implement BPF, because the interpreted packet-filter "language" wasn't a
good design to put into a kernel. But that decision was reversed by someone
that apparently thought "oohhh, it doesn't have this feature. Lets add it."
How can we keep a greatly-needed-void empty?

> No, I reckon the nasty oopses are caused by dangling pointers to a
> "struct device*", which is kmalloc'ed upon insmod of driver and kfree'd
> upon module unload.
>
> struct skbuff has a device pointer, to name one important example ;-)
>
> This nasty bug looks very awkward to fix, do we want to go mad with module
> usage counts? Or scanning packet lists and setting device pointers to some
> kind of bucket device?? yuck.

Yes, this is a problem!
I suspect that a large part of this this problem is that you can add a route
to a card that's down: if the interface is down, if might be removed at any
timed. (Was this added by someone that likes the BSD semantics? It doesn't
work with modules!)

> As for the 3c59x driver bugs, well bleh. I thought upgrading our 3c590 to
> a 3c900 would give us some plain sailing. It seems a tulip card is on the
> shopping list after all.

I ran a 3c595 card continuously over the weekend with no observered
problems. It did pause for a few seconds every time it got a 16 collsion
error, and 16-collsion error were much more frequent than I would like. I
tested on a 486 SP3G motherboard, which has known PCI burst problems that
usually trigger any bugs in the PCI bus error paths.

I've gotten many good reports about the reliability of the current driver
with both the 3c905 and 3c905B. There always at least one person that will
swear up-and-down that the driver has a problem when something else in their
system is actually the problem. I don't know how many times I have sent a
one-line "Don't use 'gated'" response to angry messages about my drivers.
The point:

Donald Becker becker@cesdis.gsfc.nasa.gov
USRA-CESDIS, Center of Excellence in Space Data and Information Sciences.
Code 930.5, Goddard Space Flight Center, Greenbelt, MD. 20771
301-286-0882 http://cesdis.gsfc.nasa.gov/pub/people/becker/whoiam.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu