Freeze ups with Cyclades cards

Eric Schenk (schenk@rnode84.cs.toronto.edu)
Sat, 24 Jun 1995 18:27:15 -0400


greg@wind.rmcc.com (G.W. Wettstein) writes:
>Sorry to shout, IS THERE ANYONE ELSE IN THE WORLD (southern hemisphere
> included.. :-) EXPERIENCING LOCK-UP PROBLEMS WITH THE
> CYCLADES BOARDS?????

I have just spent some 3 days fighting and solving a problem with
freeze ups on the Cyclades board that may be related. I can't
be sure because your original message did not describe the nature
of the freeze ups in sufficient detail. I have been in contact with
Marcio Saito <marcio@cyclades.com> at Cyclades, one of the two authors
of the driver. He and Randolph Bentson have promised some changes
in the next release of the driver due to the problems I reported.

Let me describe the problem I have been having, and my "solution".
As I said, I am not sure you are seeing the same problem, but it
is possible.

What I have been observing is apparently random cases of ports
on the board shutting down the receiver, but not the transmitter.
On the terminals in question this appears as though the terminal
stops accepting keyboard input, but it can still display output
sent to that terminal line. Although I have not observed it,
the nature of the bug should also make it possible that both
the transmitter and the receiver may be disabled on a port.

After instrumenting the code extensively I discovered that
the driver was apparently sometimes failing to enable interupts
for a port. The responsible routine in the driver is called
write_cy_cmd. This routine is supposed to write a command into
the control registers on the card. Since the card may be busy
the routine may have to wait a while before it can issue
the command, but if it has to wait too long, then it just
discards the command.

Now, there are really three separate bugs at this point.
The first is that despite the fact that the routine returns
an error code if it does discard the comand, the calling
routines ignore the error condition and open the port anyway,
even though the interupts are not turned on for that port.
The second bug is that the routine discards the command
under the wrong conditions. The routine starts out by waiting
for up to 1000 micro-seconds for the previous command to finish.
It is suppose to discard the command if it waits for the full
1000 usecs. Unfortunately, there is a coding error
and it discards the command if it had to wait 100 usecs.
Now, after some invisitagation by myself and Marcio Saito
it appears that on MOST machines the wait will be
0 about 90% of the time, 130 - 140 usec 5% of the time, and
260-280 usec 10% of the time. It seems that the machine
I was running just happened to occasionaly have a delay of 100 usec,
thus triggering this bug. Combined with the first bug, this
results in a disabled port without any notification in the debugging logs.

Specifically, you will find the following code fragement in the driver
in the routine write_cy_cmd.

save_flags(flags); cli();
/* Check to see that the previous command has completed */
for(i = 0 ; i < 100 ; i++){
if (base_addr[CyCCR] == 0){
break;
}
udelay(10L);
}
/* if the CCR never cleared, the previous command
didn't finish within the "reasonable time" */
if ( i == 10 ) {
restore_flags(flags);
return (-1);
}

/* Issue the new command */
base_addr[CyCCR] = cmd;
restore_flags(flags);
return(0);

Notice that the error branch is testing for i==10 rather
than the correct i=100. You might try to change that and see
if it has an effect on your problem.

Now, I said there were three bugs didn't I. Well the third bug is that
this routine is a busy wait that can last up to 1 second. Most often
it lasts almost no time, but about 5-10% of the time it lasts either
130 usec or 260 usec. This is far to long to spend busy waiting in
a kernel driver, and it might be interfering with other drivers.
Also, that 5-10% could conceiveably go up if your software is opening
and closing the ports very often.

So, the temporary fix is to correct the busy wait loop above, and
instrument the failure case with a printk so you can see if it is
happening. The correct fix will have to involve a rewrite of the
write_cy_cmd routine so that it no longer does busy waiting.
Marcio reports that he and Ralph will be doing this for the next
release of the driver.

-- eric

---------------------------------------------------------------------------
Eric Schenk schenk@cs.toronto.edu
Department of Computer Science, University of Toronto