Re: Fix for thread+network crashes in 2.0/2.1?

Linus Torvalds (torvalds@transmeta.com)
27 Feb 1998 04:33:39 GMT


[ Cc'd to people that I found had at some time reported a problem in
free_wait() - it may not be related, but it's worth checking out ]

In article <Pine.LNX.3.96.980226190520.6450A-100000@washu.jurai.org>,
Just a Tree <funaho@jurai.org> wrote:
>
>I've been poking around on it for nearly a day straight, unfortunately I'm
>not a kernel hacker so I'm a bit hampered.

You were still pretty helpful, thanks. I don't have glibc installed, so
I hadn't even tested the program, but your debugging appears to be
enough to find out what's up. At least I have a very string suspcicion,
and a potential patch (totally untested, but I hope there will be people
willing to do that for me).

>> This bug is forcing several of my clients to look for Linux alternatives, in
>> one case they are a Linux-only shop and really want to stick with Linux. :(
>
>> Does anyone have any technical details on *why* this is happening? I can
>> attempt a fix but only if I know what to fix, and where to fix it. :)

The bug seems to be that the program is closing a file that another
thread is currently selecting on, and the select code will get rather
unhappy when the file just suddenly disappears from under it.

We already have the support code to handle this for most normal
operations, notably read and write. "select()" wasn't protected against
this, though.

>Call Trace: [<c012bcca>] [<c012c015>] [<c0175ab1>] [<c0109b02>]
>Code: 8b 42 04 39 d8 75 f7 89 4a 04 56 9d 83 3f 00 75 d9 5b 5e 5f
>
>Using `/boot/System.map' to map addresses to symbols.
>
>>>EIP: c012ba56 <free_wait+2e/44>
>Trace: c012bcca <do_select+1ba/1d4>
>Trace: c012c015 <sys_select+331/4b4>
>Trace: c0175ab1 <sys_socketcall+155/248>
>Trace: c0109b02 <system_call+3a/40>
>Code: c012ba56 <free_wait+2e/44>
>Code: c012ba56 <free_wait+2e/44> 8b 42 04 movl 0x4(%edx),%eax
>Code: c012ba59 <free_wait+31/44> 39 d8 cmpl %ebx,%eax
>Code: c012ba5b <free_wait+33/44> 75 f7 jne fffffffe <_EIP+fffffffe>
>Code: c012ba5d <free_wait+35/44> 89 4a 04 movl %ecx,0x4(%edx)
>Code: c012ba66 <free_wait+3e/44> 56 pushl %esi

The wait queue has gotten corrupted, because the other thread has closed
the file and released all the data structures associated with it,
including the waiting information. When the waiter wakes up and removes
himself from the queue, he will hit the corruption and die..

I actually knew of this bug, I just hadn't thought it through: I thought
the bug was nasty to fix but basically essentially harmless. Obviously
that wasn't so..

Anyway, I'm releasing a 2.1.89-3 on ftp.kernel.org under the "testing"
subdirectory, and I'd be very happy if people would test it. I don't
guarantee that this patch works at all - for all I know it might be
totally broken, and the only thing I guarantee is that (a) it compiles
with my particular setup and (b) it looks like it should work and makes
sense.

Please do test, and comment,

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu