Re: FS Corruption in 2.1.109 (fwd)

Richard B. Johnson (root@chaos.analogic.com)
Thu, 30 Jul 1998 09:28:07 -0400 (EDT)


On Thu, 30 Jul 1998, MOLNAR Ingo wrote:

>
> On Wed, 29 Jul 1998, Alan Cox wrote:
>
> > So for the non SMP case I think its solved and Linus can have a small gold
> > star. SMP alone has no obvious pattern at all (except SMP 8( )
>
> could you tell all people on your list seeing problems on SMP to try my
> patch, does it make a difference?
>
> for my box, the 'requirement' for stability seems to be that the
> 'disable_irq()+start_request()+enable_irq()' trio (ide.c:line 1130) has to
> be atomic. If any sti disrupts them anywhere, i get a lockup sooner or
> later.
>
> -- mingo
>

Info:
About a year ago, I remember reported problems with IDE, and even
IDE that used PIO (no DMA). The "consensus" was that even though
PIO was interruptable, it should not be allowed to be interrupted
in the kernel. Of course, if access to a device is made atomic by
other means (locking, etc.), you should be able to interrupt PIO
at will, the only problem being reduced throughput because the
CPU is being shared.

Now we have UDMA (Ugly DMA). Once the chip is programmed, the
CPU is available for other uses while the DMA completes. Looking
at the sources, I see that the CPU is really just polling for
completion (yes, the interrupts are enabled so something else could
use the CPU). Any problem with locking, that could prevent these
I/O operations from being atomic can cause problems.

Since enabling interrupts during the PIO operation was a bad_thing(tm),
and enabling interrupts during the UDMA operation is even worse, my
first guess is that there is a problem with locking, my last guess
would be actual hardware problems.

The IDE code as well as a lot of other code in the kernel has become
very complex over the years. A lot of the mucking with interrupts
could probably be handled with a simple lock on each of the procedures
that must be atomic.

At my company, we developed an operating system that, on the average
gets interrupted 4,280 times per second (data link), 2,048 times per
second (timer channel 0, context switcher), plus various network,
SCSI, etc., interrupts. The machine spends a lot of its time handling
interrupts (80%). We could not afford to disable/enable interrupts.

Therefore we have a simple locking mechanism:

unsigned int global_lock_word=0;
unsigned int critical_procedure_lock=0;

int critical_procedure(params)
{
unsigned int lock;

lock = global_lock_word++; /* Pick a unique number */
if(lock == 0) /* We got unlucky, try again */
lock = global_lock_word++;
critical_procedure_lock += lock; /* Sum to whatever is there */
while(critical_procedure_lock != lock)
sched(); /* Wait until we own it */

.............. /* Do critical code */
..............

critical_procedure_lock -= lock; /* Release the lock */
}

The unique number could be a pseudo-random number, but you don't
need it because there will not be (2^32)-1 things trying to use
a shared resource at the same time.

Now, nothing necessary to acquire the lock must be atomic itself.
There _is_ a possibility of a deadlock because we may never see
the instant at which critical_procedure_lock == lock, however if
we "think" we own the resource, we truly do. Deadlocks get fixed
by releasing the lock and trying again. Eventually you will acquire
the resource and if everybody follows the same rules, you don't
need to disable interrupts for anything.

Cheers,
Dick Johnson
***** FILE SYSTEM MODIFIED *****
Penguin : Linux version 2.1.111 on an i586 machine (66.15 BogoMips).
Warning : It's hard to remain at the trailing edge of technology.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html