Re: Hanging problem...

Linus Torvalds (Linus.Torvalds@cs.helsinki.fi)
Mon, 30 Oct 1995 07:24:52 +0200


Jim Paradis: "Hanging problem..." (Oct 28, 1:49):
>
> I've been getting mysterious hangs whenever I bang *hard* on the disk
> (e.g. when I take 30+Mb of stuff and crunch it down into 8 floppy
> images...). This is, to say the least, annoying.

Agreed. I've seen this kind of hang mainly if some process is killed
while in the kernel (bad kernel pointer dereference or similar), and
leaving a buffer on the wrong queue.

> I've seen this mainly on the IDE disk (e.g. moving stuff to the SCSI
> disk often alleviates the problem), but I coulda sworn I saw it once or
> twice when I've been just banging on the SCSI disk.
>
> The system doesn't *completely* hang; I can do things that don't use
> the affected drive. Anything that does (including a sync) hangs forever.
>
> Hitting SHIFT+SCROLLOCK when this happens reveals that in every case
> there is exactly *one* buffer that's locked... so I think there's either
> a deadlock or some code path that's not releasing a buffer when it
> should... I'm not sure this is an Alpha-specific problem either...

It probably _isn't_ alpha-specific. It might just show up more clearly
on the alpha for some reason.

(well, it _could_ be due to differences in irq handling or something
like that on the alpha, but I don't think so).

> I backtracked through David M-T's wonderful collection of prebuilt
> kernels; the problem doesn't appear in 1.3.27 but does appear in 1.3.31
> and later (I've tried all the way through .36). I suppose I could
> look at the diffs, but I was wondering if anyone had any ideas off
> the top of their head...

Well, 1.3.28 changed the internal representation of a "device number"
(so that the kernel internally uses a "kdev_t" rather than the "dev_t").
It also had some other cleanups in device handling, that may or may not
have had problems. The code _should_ be equivalent with the 1.3.27
code, though.

1.3.29 changed the "mem_map[]" to be a structure, but that shouldn't
matter.

1.3.30 and 1.3.31 shouldn't have changed the buffer handling, although
there might have been driver changes.

I haven't seen the behaviour you mention: I'm using a pre-1.3.38 kernel
on my Cabriolet, and I've been using 1.3.3x kernels on this machines the
whole time. But I have to admit that I haven't used the floppy much,
and maybe what I've been doing can't be called heavy-duty (mostly
compilations while under X11 etc). I certainly haven't seen the
problems.

I'll try to come up with some idea, but as I'll be away to Romania for
the rest of the week starting early tomorrow I suspect I can't much
help. If you can reasonably easily repeat this, can you tell excatly
which kernel it is that starts showing the problem?

Linus