Re: Fix for thread+network crashes in 2.0/2.1?

Just a Tree (funaho@jurai.org)
Thu, 26 Feb 1998 19:09:24 -0500 (EST)


On Thu, 26 Feb 1998, Jeff Garzik wrote:

> Hi guys,
>
> Has anyone done any further investigation into the serious
> networking+threads crash problem in 2.0 and 2.1?
> (reported by bill@highwind.com, confirmed by Alan Cox)

I've been poking around on it for nearly a day straight, unfortunately I'm
not a kernel hacker so I'm a bit hampered.

> This bug is forcing several of my clients to look for Linux alternatives, in
> one case they are a Linux-only shop and really want to stick with Linux. :(

> Does anyone have any technical details on *why* this is happening? I can
> attempt a fix but only if I know what to fix, and where to fix it. :)

I will share what little I've found. As Bill reported, most of the time
the machine simply freezes. However, when it -doesn't- freeze, it
generates an oops, and it's always in the same place:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
current->tss.cr3 = 036df000, %cr3 = 036df000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c012ba56>]
EFLAGS: 00010007
eax: 00000000 ebx: c332a000 ecx: c38b6f68 edx: 00000000
esi: 00000297 edi: c3339f68 ebp: 00000001 esp: c3339f34
ds: 0018 es: 0018 ss: 0018
Process crash (pid: 382, process nr: 41, stackpage=c3339000)
Stack: c332b000 00000020 00000006 c012bcca c3339f68 c332b280 00000001 c332b284
c332b000 c34572c0 c3338000 c3743b7c 00000000 00000000 c332a000 c012c015
00000006 c332b000 00012700 c3338000 00000000 bedfeabc bedfece0 c0175ab1
Call Trace: [<c012bcca>] [<c012c015>] [<c0175ab1>] [<c0109b02>]
Code: 8b 42 04 39 d8 75 f7 89 4a 04 56 9d 83 3f 00 75 d9 5b 5e 5f

Using `/boot/System.map' to map addresses to symbols.

>>EIP: c012ba56 <free_wait+2e/44>
Trace: c012bcca <do_select+1ba/1d4>
Trace: c012c015 <sys_select+331/4b4>
Trace: c0175ab1 <sys_socketcall+155/248>
Trace: c0109b02 <system_call+3a/40>
Code: c012ba56 <free_wait+2e/44>
Code: c012ba56 <free_wait+2e/44> 8b 42 04 movl 0x4(%edx),%eax
Code: c012ba59 <free_wait+31/44> 39 d8 cmpl %ebx,%eax
Code: c012ba5b <free_wait+33/44> 75 f7 jne fffffffe <_EIP+fffffffe>
Code: c012ba5d <free_wait+35/44> 89 4a 04 movl %ecx,0x4(%edx)
Code: c012ba66 <free_wait+3e/44> 56 pushl %esi
Code: c012ba67 <free_wait+3f/44> 9d popf
Code: c012ba68 <free_wait+40/44> 83 3f 00 cmpl $0x0,(%edi)
Code: c012ba6b <free_wait+43/44> 75 d9 jne ffffffea <_EIP+ffffffea>
Code: c012ba6d <max_select_fd+1/a4> 5b popl %ebx
Code: c012ba6e <max_select_fd+2/a4> 5e popl %esi
Code: c012ba6f <max_select_fd+3/a4> 5f popl %edi

(ksymoops is getting confused about the instruction lengths, so all this
code is actually in free_wait(), which is in fs/select.c)

Essentially it looks like a corrupted wait_queue. I'm currently in the
process of trying to hack in some saftey checks into add_wait_queue() and
remove_wait_queue() to try to find where the problem is occuring, but so
far I haven't had any luck.

Note that while this case is a NULL pointer deref, I've also had just
plain invalid pointers happen as well. I only catch these when they're
outside of kernel space and cause an invalid paging request. When they do
occur though they are still crashing at the same instruction in
free_wait().

-- 
funaho@jurai.org             | I'm a man who's sick, but I got class
http://www.jurai.org/~funaho | Cuz you only got respect when you're kickin' ass
                             |              - KMFDM

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu