Re: [patch] IDE problems on SMP, fixed? (fwd)

Gadi Oxman (gadio@netvision.net.il)
Wed, 29 Jul 1998 22:10:08 +0400 (IDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Linus Torvalds: "Re: FS Corruption in 2.1.109 (fwd)"
Previous message: Bill Hawes: "patch for 2.1.112 mm/swap_state.c"

Hi,

> > under very heavy (artificial) load, i can reliably make my system fail, it
> > repeatedly fails to release a very important irq spinlock, causing a hard
> > hang. The case where this happens is _always_ when an (arbitrary)
> > interrupt hits us after ide.c's ide__sti() in ide.c:start_request().
>
> as an interesting turn of events, my system is now very stable with the
> attached patch applied.
>
> could any IDE-expert (Gadi, Mark?) verify why this small fix makes a
> difference on this SMP board? That sti() has been in ide_do_request() for
> ages.

Between the two places we only have a couple of non-functional if()'s
and assignments to static variables on the stack.

However, just prior to start_request() we are releasing the io_request_lock
spinlock. Let's verify if the system then locks up while getting the
io_request_lock spinlock from another interrupt (SCSI, for example),
which seems to point to a big (probably hardware?) problem, in which
releasing a spinlock is not actually seen by the interrupt code which
tries to get it immediately afterwards.

> The fix is arbitrary, i have a RAM module thats known to be bad
> (although i never have any problems with it on 66MHz system bus, except
> this single lockup). I've traced down the lockup and have 'fixed' the case
> by moving the sti(), but no other thinking was behind this change ...
>
> i do not really understand though why this sti() is considered safe on
> SMP, as we have dropped the io_request_lock already, and we have dropped
> the hwif->lock too, so we are just asking for trouble on another CPU, is
> my thinking correct that at this point another CPU could add a request to
> this hwif? Or is some other lock (hwif->busy?) handling this case already.

ide__sti() is being performed just on the local CPU, and the original
reason is to give (a uni-processor) system an opprtunity to take a pending
interrupt at least between IDE requests, even in case we don't allow
servicing of interrupts during the IDE interrupt handler.

At this point, another CPU can add another request to the queue, but
that's ok; ll_rw_blk.c will skip and not change the current active request
during its entire processing.

Gadi

> to get things right, this is a fairly standard configuration, good'ole
> Quantum FB 1280ATA, PIIX4. The lockup was independent of _any_ BIOS
> setting (Passive Release, etc.). Maybe this is an IDE problem after all?
>
> (while writing this email, the stress-test was still running, this is with
> a DMA-enabled kernel, and the _very same_ hardware and software
> configuration caused a lockup after 20 seconds without the patch
> installed. With the patch installed i got no lockup after 30 minutes of
> uptime. It still might be hardware problems, although i have 3 fans and no
> other problem has occured on this system so far, only this IDE lockup)
>
> -- mingo
>
> --- linux/drivers/block/ide.c.orig Wed Jul 29 15:09:18 1998
> +++ linux/drivers/block/ide.c Wed Jul 29 15:09:44 1998
> @@ -961,7 +961,6 @@
> unsigned int minor = MINOR(rq->rq_dev), unit = minor >> PARTN_BITS;
> ide_hwif_t *hwif = HWIF(drive);
>
> - ide__sti(); /* local CPU only */
> #ifdef DEBUG
> printk("%s: start_request: current=0x%08lx\n", hwif->name, (unsigned long) rq);
> #endif
> @@ -988,6 +987,7 @@
> block = 1; /* redirect MBR access to EZ-Drive partn table */
> #endif /* FAKE_FDISK_FOR_EZDRIVE */
> #if (DISK_RECOVERY_TIME > 0)
> + ide__sti(); /* local CPU only */
> while ((read_timer() - hwif->last_time) < DISK_RECOVERY_TIME);
> #endif
> SELECT_DRIVE(hwif, drive);
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html

Next message: Linus Torvalds: "Re: FS Corruption in 2.1.109 (fwd)"
Previous message: Bill Hawes: "patch for 2.1.112 mm/swap_state.c"