Re: Re[2]: Kernel testing

Gerard Roudier (groudier@club-internet.fr)
Mon, 14 Apr 1997 08:54:35 +0000 (GMT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Doug Ledford: "Re: SCSI Bus hanging"
Previous message: bofh@snoopy.virtual.net.au: "Re: UNIX: The Experiment That Failed"
In reply to: Systemkennung Linux: "Re: Re[2]: Kernel testing"
Next in thread: Rodrigo Barbosa: "Re: UNIX: The Experiment That Failed"
Reply: Rodrigo Barbosa: "Re: UNIX: The Experiment That Failed"

On Mon, 14 Apr 1997, Systemkennung Linux wrote:

> Hi,
>
> > I think that unless we come up with some ideas on how to implement this,
> > we should drop it.
> >
> > 1. How are we going to implement issues like timings? Think about the
> > recent TCP-stall problem for one.
>
> Timings and race conditions are difficult to test. Another problem is
> that we'll hardly will be able to cover all the possible cases. If
> you try to check for things like for example the handling of hardware
> errors (r/w errors on disks, ...) you'll have to add special code to
> "inject" errors.

Being given that any untested code has every chance to not work properly,
"inject" errors technics is often used in order to perform basic
error recovery testing in pieces of code that deal with the hardware.

Until today, I removed the "inject" errors code before releasing a
version of the BSD ported ncr driver. This interesting thread and
especially your posting let me decide to now release driver versions
with this code embedded, but obviously not compiled by default.

I just add some documentation about the "inject" errors code and here
is a pre-patch of the next version of the driver I will release.

--
Gerard.

(This patch is against driver version 1.18d)

--- linux/drivers/scsi/ChangeLog.ncr53c8xx.1.18d	Sun Apr 13 17:45:46 1997
+++ linux/drivers/scsi/ChangeLog.ncr53c8xx	Mon Apr 14 08:20:13 1997
@@ -1,3 +1,14 @@
+Sun Apr 14  9:00 1997 Gerard Roudier (groudier@club-internet.fr)
+	* revision 1.18e
+	- incorporate in the driver, the code used for error recovery  
+	  testing. This code is normally not compiled; have to define 
+	  SCSI_NCR_DEBUG_ERROR_RECOVERY in order to compile it.
+	- reset all when an unexpected data cycle is detected while 
+	  disconnecting.
+	- make changes to abort() ans reset() functions according to 
+	  Leonard's documentation.
+	- small fix in some message for hard errors.
+
 Sat Apr 5  13:00 1997 Gerard Roudier (groudier@club-internet.fr)
 	* revision 1.18d
 	- Probe NCR pci device ids in reverse order if asked by user from 
--- linux/drivers/scsi/README.ncr53c8xx.1.18d	Sun Apr 13 17:46:14 1997
+++ linux/drivers/scsi/README.ncr53c8xx	Mon Apr 14 08:19:50 1997
@@ -1,10 +1,10 @@
-The linux NCR53C8XX driver README file
+The Linux NCR53C8XX driver README file
 
 Written by Gerard Roudier <groudier@club-internet.fr>
 21 Rue Carnot
 95170 DEUIL LA BARRE - FRANCE
 
-6 April 1997
+14 April 1997
 ===============================================================================
 
 1.  Introduction
@@ -22,6 +22,7 @@
       8.5  Set debug mode
       8.6  Clear profile counters
       8.7  Set flag (no_sync)
+      8.8  Debug error recovery
 9.  Configuration parameters
 10. Boot setup commands
       10.1 Syntax
@@ -203,7 +204,6 @@
   IO port address 0x6000, IRQ number 10
   Using memory mapped IO at virtual address 0x282c000
   Synchronous transfer period 25, max commands per lun 4
-
 Profiling information:
   num_trans    = 18014
   num_kbytes   = 671314
@@ -390,6 +390,28 @@
     - setflag all
       will allow disconnection for all devices on the SCSI bus.
 
+
+8.8 Debug error recovery
+
+    debug_error_recovery <error to trigger>
+
+    Available error type to trigger:
+        sge:     SCSI gross error
+        abort:   abort command from the middle-level driver
+        reset:   reset command from the middle-level driver
+        none:    restore driver normal behaviour
+
+    The code corresponding to this feature is normally not compiled.
+    Its purpose is driver testing only. In order to compile the code 
+    that allows to trigger error recovery you must define at compile time 
+    SCSI_NCR_DEBUG_ERROR_RECOVERY.
+    If you have compiled the driver with this option, nothing will happen 
+    as long as you donnot use the control command 'debug_error_recovery' 
+    with sge, abort or reset as argument.
+    If you select an error type, it will be triggered by the driver every 
+    30 seconds.
+
+
 9. Configuration parameters
 
 If the firmware of all your devices is perfect enough, all the
@@ -736,9 +758,9 @@
 
 You must untar the distribution with the following command:
 
-	tar zxvf ncrBsd2Linux-1.18d-src.tar.gz
+	tar zxvf ncrBsd2Linux-1.18e-src.tar.gz
 
-The sub-directory ncr53c8xx-1.18d will be created. Change to this directory.
+The sub-directory ncr53c8xx-1.18e will be created. Change to this directory.
 
 
 12.2 Installation procedure
--- linux/drivers/scsi/ncr53c8xx.h.1.18d	Sat Apr  5 12:38:51 1997
+++ linux/drivers/scsi/ncr53c8xx.h	Sun Apr 13 18:39:50 1997
@@ -45,7 +45,7 @@
 /*
 **	Name and revision of the driver
 */
-#define SCSI_NCR_DRIVER_NAME		"ncr53c8xx - revision 1.18d"
+#define SCSI_NCR_DRIVER_NAME		"ncr53c8xx - revision 1.18e"
  
 /*
 **	If SCSI_NCR_SETUP_SPECIAL_FEATURES is defined,
--- linux/drivers/scsi/ncr53c8xx.c.1.18d	Sun Apr  6 21:08:15 1997
+++ linux/drivers/scsi/ncr53c8xx.c	Sun Apr 13 18:41:31 1997
@@ -40,7 +40,7 @@
 */
 
 /*
-**	6 April 1997, version 1.18d
+**	12 April 1997, version 1.18e
 **
 **	Supported SCSI-II features:
 **	    Synchronous negotiation
@@ -537,7 +537,29 @@
 	#define DEBUG_FLAGS	SCSI_NCR_DEBUG_FLAGS
 #endif
 
-
+/*
+**    Embedded error recovery debug code.
+**
+**    This code is conditionned by SCSI_NCR_DEBUG_ERROR_RECOVERY.
+**    It only can be enabled after boot-up with a control command.
+**
+**    Every 30 seconds the timer handler of the driver decides to 
+**    change the behaviour of the driver in order to trigger errors.
+**
+**    If last command was "debug_error_recovery sge", the driver sets sync 
+**    offset of all targets that use sync transfers to 2, and so will get 
+**    a SCSI gross error at the next read operation.
+**
+**    If last command was "debug_error_recovery abort", the driver does 
+**    not signal new scsi commands to the script processor, until it is asked 
+**    to abort or reset a command by the mid-level driver.
+**
+**    If last command was "debug_error_recovery reset", the driver does 
+**    not signal new scsi commands to the script processor, until it is asked 
+**    to reset a command by the mid-level driver.
+**
+**    The command "debug_error_recovery none" makes the driver behave normaly.
+*/
 
 /*==========================================================
 **
@@ -798,6 +820,10 @@
 #define UC_SETFLAG	15
 #define UC_CLEARPROF	16
 
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+#define UC_DEBUG_ERROR_RECOVERY 17
+#endif
+
 #define	UF_TRACE	(0x01)
 #define	UF_NODISC	(0x02)
 
@@ -1347,6 +1373,10 @@
 					/* that we can't put into the squeue */
 	u_long	settle_time;		/* Reset in progess		     */
 	u_char	release_stage;		/* Synchronisation stage on release  */
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	u_char	stalling;
+	u_char	debug_error_recovery;
+#endif
 
 	/*-----------------------------------------------
 	**	Added field to support differences
@@ -4463,6 +4493,9 @@
 	**	Script processor may be waiting for reselect.
 	**	Wake it up.
 	*/
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	if (!np->stalling)
+#endif
 	OUTB (nc_istat, SIGP);
 
 	/*
@@ -4520,7 +4553,7 @@
 **
 **==========================================================
 */
-int ncr_reset_bus (Scsi_Cmnd *cmd)
+int ncr_reset_bus (Scsi_Cmnd *cmd, int sync_reset)
 {
         struct Scsi_Host   *host      = cmd->host;
 /*	Scsi_Device        *device    = cmd->device; */
@@ -4530,6 +4563,11 @@
 	u_long flags;
 	int found;
 
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	if (np->stalling)
+		np->stalling = 0;
+#endif
+
 	save_flags(flags); cli();
 /*
  * Return immediately if reset is in progress.
@@ -4572,11 +4610,12 @@
  */
 	ncr_wakeup(np, HS_RESET);
 /*
- * If the involved command was not in a driver queue, and is 
- * not in the waiting list, complete it with DID_RESET status,
+ * If the involved command was not in a driver queue, and the 
+ * scsi driver told us reset is synchronous, and the command is not 
+ * currently in the waiting list, complete it with DID_RESET status,
  * in order to keep it alive.
  */
-	if (!found && cmd && !remove_from_waiting_list(np, cmd)) {
+	if (!found && sync_reset && !retrieve_from_waiting_list(0, np, cmd)) {
 		cmd->result = ScsiResult(DID_RESET, 0);
 		cmd->scsi_done(cmd);
 	}
@@ -4606,6 +4645,11 @@
 	int found;
 	int retv;
 
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	if (np->stalling == 2)
+		np->stalling = 0;
+#endif
+
 	save_flags(flags); cli();
 /*
  * First, look for the scsi command in the waiting list
@@ -4662,6 +4706,9 @@
 	**      processor will sleep on SEL_WAIT_RESEL.
 	**      Let's wake it up, since it may have to work.
 	*/
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	if (!np->stalling)
+#endif
 	OUTB (nc_istat, SIGP);
 
 	restore_flags(flags);
@@ -5928,6 +5975,11 @@
 	case UC_CLEARPROF:
 		bzero(&np->profile, sizeof(np->profile));
 		break;
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	case UC_DEBUG_ERROR_RECOVERY:
+		np->debug_error_recovery = np->user.data;
+		break;
+#endif
 	}
 	np->user.cmd=0;
 }
@@ -5977,6 +6029,54 @@
 	add_timer(&np->timer);
 
 	/*
+	** 	If SCSI_NCR_DEBUG_ERROR_RECOVERY is defined, we want to 
+	**	simulate common errors in order to test error recovery.
+	*/
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	do {
+		static u_long start = 0l;
+		static u_long last = 0l;
+
+		if (!np->debug_error_recovery)
+			break;
+		if (!start) {
+			start = jiffies;
+			last  = start;
+		}
+		if (jiffies < last + 30*HZ)
+			break;
+		last = jiffies;
+		/*
+		 * This one triggers SCSI gross errors.
+		 */
+		if (np->debug_error_recovery == 1) {
+			int i;
+			printf("%s: testing error recovery from SCSI gross error...\n", ncr_name(np));
+			for (i = 0 ; i < MAX_TARGET ; i++) {
+				if (np->target[i].sval & 0x1f) {
+					np->target[i].sval &= ~0x1f;
+					np->target[i].sval += 2;
+				}
+			}
+		}
+		/*
+		 * This one triggers abort from the mid-level driver.
+		 */
+		else if (np->debug_error_recovery == 2) {
+			printf("%s: testing error recovery from mid-level driver abort()...\n", ncr_name(np));
+			np->stalling = 2;
+		}
+		/*
+		 * This one triggers reset from the mid-level driver.
+		 */
+		else if (np->debug_error_recovery == 3) {
+			printf("%s: testing error recovery from mid-level driver reset()...\n", ncr_name(np));
+			np->stalling = 3;
+		}
+	} while (0);
+#endif
+
+	/*
 	**	If we are resetting the ncr, wait for settle_time before 
 	**	clearing it. Then command processing will be resumed.
 	*/
@@ -6083,6 +6183,9 @@
 			*/
 			ncr_complete (np, cp);
 
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+			if (!np->stalling)
+#endif
 			OUTB (nc_istat, SIGP);
 		}
 		restore_flags(flags);
@@ -6137,6 +6240,7 @@
 {
 	u_int32	dsp;
 	int	script_ofs;
+	int	script_size;
 	char	*script_name;
 	u_char	*script_base;
 	int	i;
@@ -6145,11 +6249,13 @@
 
 	if (dsp > np->p_script && dsp <= np->p_script + sizeof(struct script)) {
 		script_ofs	= dsp - np->p_script;
+		script_size	= sizeof(struct script);
 		script_base	= (u_char *) np->script;
 		script_name	= "script";
 	}
 	else {
 		script_ofs	= dsp - np->p_scripth;
+		script_size	= sizeof(struct scripth);
 		script_base	= (u_char *) np->scripth;
 		script_name	= "scripth";
 	}
@@ -6161,7 +6267,7 @@
 		(unsigned)INL (nc_dbc));
 
 	if (((script_ofs & 3) == 0) &&
-	    (unsigned)script_ofs < sizeof(struct script)) {
+	    (unsigned)script_ofs < script_size) {
 		printf ("%s: script cmd = %08x\n", ncr_name(np),
 			(int) *(ncrcmd *)(script_base + script_ofs));
 	}
@@ -6192,6 +6298,11 @@
 	*/
 	while ((istat = INB (nc_istat)) & INTF) {
 		if (DEBUG_FLAGS & DEBUG_TINY) printf ("F ");
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	if (np->stalling)
+		OUTB (nc_istat, INTF);
+	else
+#endif
 		OUTB (nc_istat, (istat & SIGP) | INTF);
 		np->profile.num_fly++;
 		ncr_wakeup (np, 0);
@@ -6363,27 +6474,13 @@
 		((INL(nc_dbc) & 0xf8000000) == SCR_WAIT_DISC)) {
 		/*
 		**      Unexpected data cycle while waiting for disconnect.
-		*/
-		if (INB(nc_sstat2) & LDSC) {
-			/*
-			**	It's an early reconnect.
-			**	Let's continue ...
-			*/
-			OUTONB (nc_dcntl, (STD|NOCOM));
-			/*
-			**	info message
-			*/
-			printf ("%s: INFO: LDSC while IID.\n",
-				ncr_name (np));
-			return;
-		};
-		printf ("%s: target %d doesn't release the bus.\n",
-			ncr_name (np), (int)INB (nc_ctest0)&0x0f);
-		/*
-		**	return without restarting the NCR.
-		**	timeout will do the real work.
-		*/
-		return;
+		**	LDSC and CON bits may help in order to understand 
+		**	what really happened. Print some info message and let 
+		**	the reset function reset the BUS and the NCR.
+		*/
+		printf("%s:%d: data cycle while waiting for disconnect, LDSC=%d CON=%d\n",
+			ncr_name (np), (int)(INB(nc_ctest0)&0x0f),
+			(0!=(INB(nc_sstat2)&LDSC)), (0!=(INB(nc_scntl1)&ISCON)));
 	};
 
 	/*----------------------------------------
@@ -8617,25 +8714,95 @@
 **   Linux entry point of reset() function
 */
 
-#if	LINUX_VERSION_CODE >= LinuxVersionCode(1,3,98)
+#if defined SCSI_RESET_SYNCHRONOUS && defined SCSI_RESET_ASYNCHRONOUS
+
 int ncr53c8xx_reset(Scsi_Cmnd *cmd, unsigned int reset_flags)
+{
+	int sts;
+	unsigned long flags;
+
+	printk("ncr53c8xx_reset: pid=%lu reset_flags=%x serial_number=%ld serial_number_at_timeout=%ld\n",
+		cmd->pid, reset_flags, cmd->serial_number, cmd->serial_number_at_timeout);
+
+	save_flags(flags); cli();
+#if 1
+	/*
+	 * We have to just ignore reset requests in some situations.
+	 * Seems the mid-level driver sometimes is stuttering too much.
+	 */
+#if defined SCSI_RESET_NOT_RUNNING
+	if (cmd->serial_number != cmd->serial_number_at_timeout) {
+		sts = SCSI_RESET_NOT_RUNNING;
+		goto out;
+	}
+#endif
+	/*
+	 * If the mid-level driver told us reset is synchronous, it seems 
+	 * that we must call the done() callback for the involved command, 
+	 * even if this command was not queued to the low-level driver, 
+	 * before returning SCSI_RESET_SUCCESS.
+	 */
+#endif
+	sts = ncr_reset_bus(cmd,
+	(reset_flags & (SCSI_RESET_SYNCHRONOUS | SCSI_RESET_ASYNCHRONOUS)) == SCSI_RESET_SYNCHRONOUS);
+	/*
+	 * Since we always reset the controller, when we return success, 
+	 * we add this information to the return code.
+	 */
+#if defined SCSI_RESET_HOST_RESET
+	if (sts == SCSI_RESET_SUCCESS)
+		sts |= SCSI_RESET_HOST_RESET;
+#endif
+
+out:
+	restore_flags(flags);
+	return sts;
+}
 #else
 int ncr53c8xx_reset(Scsi_Cmnd *cmd)
-#endif
 {
 	printk("ncr53c8xx_reset: command pid %lu\n", cmd->pid);
 	return ncr_reset_bus(cmd);
 }
+#endif
 
 /*
 **   Linux entry point of abort() function
 */
 
+#if defined SCSI_RESET_SYNCHRONOUS && defined SCSI_RESET_ASYNCHRONOUS
+
+int ncr53c8xx_abort(Scsi_Cmnd *cmd)
+{
+	int sts;
+	unsigned long flags;
+
+	printk("ncr53c8xx_abort: pid=%lu serial_number=%ld serial_number_at_timeout=%ld\n",
+		cmd->pid, cmd->serial_number, cmd->serial_number_at_timeout);
+
+	save_flags(flags); cli();
+#if 1
+	/*
+	 * We have to just ignore abort requests in some situations.
+	 * Seems the mid-level driver sometimes is stuttering too much.
+	 */
+	if (cmd->serial_number != cmd->serial_number_at_timeout) {
+		sts = SCSI_ABORT_NOT_RUNNING;
+		goto out;
+	}
+#endif
+	sts = ncr_abort_command(cmd);
+out:
+	restore_flags(flags);
+	return sts;
+}
+#else
 int ncr53c8xx_abort(Scsi_Cmnd *cmd)
 {
 	printk("ncr53c8xx_abort: command pid %lu\n", cmd->pid);
 	return ncr_abort_command(cmd);
 }
+#endif
 
 #ifdef MODULE
 int ncr53c8xx_release(struct Scsi_Host *host)
@@ -8873,6 +9040,10 @@
 		uc->cmd = UC_SETFLAG;
 	else if	((arg_len = is_keyword(ptr, len, "clearprof")) != 0)
 		uc->cmd = UC_CLEARPROF;
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	else if	((arg_len = is_keyword(ptr, len, "debug_error_recovery")) != 0)
+		uc->cmd = UC_DEBUG_ERROR_RECOVERY;
+#endif
 	else
 		arg_len = 0;
 
@@ -8973,6 +9144,22 @@
 			ptr += arg_len; len -= arg_len;
 		}
 		break;
+#ifdef	SCSI_NCR_DEBUG_ERROR_RECOVERY
+	case UC_DEBUG_ERROR_RECOVERY:
+		SKIP_SPACES(1);
+		if	((arg_len = is_keyword(ptr, len, "sge")))
+			uc->data = 1;
+		else if	((arg_len = is_keyword(ptr, len, "abort")))
+			uc->data = 2;
+		else if	((arg_len = is_keyword(ptr, len, "reset")))
+			uc->data = 3;
+		else if	((arg_len = is_keyword(ptr, len, "none")))
+			uc->data = 0;
+		else
+			return -EINVAL;
+		ptr += arg_len; len -= arg_len;
+		break;
+#endif
 	default:
 		break;
 	}
----------------------------- Cut Here ----------------------------------

Next message: Doug Ledford: "Re: SCSI Bus hanging"
Previous message: bofh@snoopy.virtual.net.au: "Re: UNIX: The Experiment That Failed"
In reply to: Systemkennung Linux: "Re: Re[2]: Kernel testing"
Next in thread: Rodrigo Barbosa: "Re: UNIX: The Experiment That Failed"
Reply: Rodrigo Barbosa: "Re: UNIX: The Experiment That Failed"