Re: 2.0.30 crash in lock_remove_locks (cont'd)

Carlo Wood (carlo@runaway.xs4all.nl)
Fri, 16 May 1997 20:12:44 +0200 (MET DST)


| On Thu, 15 May 1997, Carlo Wood wrote:
|
| > Hi again,
| >
| > I think I also know which part of the patches is the problem *shrug* :
| > the new ones :). That is, I need to run several irc daemons with
| > each more then 2048 clients (for testing ircu2.10), so I tried to increase
| > the number of fd's per process to 4096.
| >
| Hi Carlo,
|
| Yeah, but you forgot to change __FD_SETSIZE in include/linux/posix_types.h
| and /usr/include/gnu/types.h for one.

I did the O'Reilly patch by hand already, and thus also changed __FD_SETSIZE
in include/linux/posix_types.h now.
I'll include the complete patch below (relative to 2.0.30).

| BTW gurus: isn't it dangerous having NR_OPEN and __FD_SETSIZE defined in
| multiple places? I noticed that recent 2.0.x and 2.1.x kernels have __FD_SETSIZE
| defined as 1024. /usr/include/gnu/types.h has it defined as 256.

If I do not touch __FD_SETSIZE, it still gets defined the same as NR_OPEN,
this is done by #ifdef's in some header file (I am too lazy to look it up),
but I am 100%. (I used this program to test it:)

>echo '#include <time.h>
#include <stdio.h>
main(){printf("%d\n",FD_SETSIZE);}' > t.c;gcc t.c -o t;./t;rm t t.c

(cut&paste on the prompt). It gives the #defined FD_SETSIZE, which is
4096 even if you DON'T change __FD_SETSIZE...

| > I think that the kernel crash is still a bug. I also saw once a patch
| > for 2.0.29 that made it possible to do this (this huge amount of fd's), but
| > this patch didn't apply to 2.0.30 : all hunks but one failed. I hoped it
| > was solved in another way.
| >
|
| Increasing # of fd's without the malloc patch creates a stack size problem.
| limited_fd_set is 32 bytes at 256 fd's.. 512 at 4096. Sys_select allocates
| 3 of these on the stack.. 1.5k of stack usage for 1 function is pretty much
| a guaranteed overflow. That's where I think your crash comes from. Also,
| I can say that because I did that once too :) with the same result.. kaboom.

There are reports that select() fails (does not return when sockets are
indeed readable with new data) when more then about 512 fd's are involved,
we'd like very much to know what this causes, because it still makes it
impossible to use linux as production ircd on a large net.

| Your change to struct cmsghdr is not needed. The cmsg_data[0] should be a
| zero sized array. What gcc does with this kind of construct is.. when you
| dereference this pointer, it points to the next available memory cell of the
| type you allocated *at the time you allocated it*. This works fine as long
| as the data is aligned (true in this case) or if you use __attribute__((aligned)).
| I thoroughly tested this the first time it came up out of curiosity/
| needless_worry.

I changed it because it caused a warning that zero sized arrays are not ANSI C++
(or ANSI C?). And I don't accect warnings in my programs, I use
gcc -ansi -pedantic -Werror, so I *have* to patch 'broken' headers.

| > Please let me know what to do if I need 4096 fd's per process (and a LOT
| > more for the total system, for which I thought I can use /proc/kernel).
| >
| > Carlo
| >
|
| The O'Reilly patch goes in easily though you must do it manually. I have
| it in my 2.0.kinda_sorta.31 tree, and have been using it since it came out.
|
| Bottom line is that this breaks one *hell* of a lot of stuff.. better really
| need it if you start because you'll end up recompiling nearly every library/
| binary in your system to get things straight again. BTDT: reverted to 256 :)

Indeed I needed to recompile every program that uses select() of FD_ZERO,
and also strace, to use it. But I did not (yet) recompile libc, is that needed
too ?

| Ciao,
|
| -Mike

Note that everything now seems to work, it compiles and I can use 4096 fd's
per process. But select() still 'hangs' and gets very slow, it doesn't look
good. For instance, I can NOT connect more then about 670 clients to an
irc server, the rest drops off after hanging in select a while and then
error-ing on write() with ETIMEDOUT (after select returned because of another
reason).

Carlo

-- 
 carlo@runaway.xs4all.nl, Run @ IRC.

ircd development: http://www.xs4all.nl/~carlo17/ircd-dev

==================Used patch===============================================

---------- Forwarded message ---------- From: Jon Lewis <jlewis@INORGANIC5.FDT.NET>

Attached is my copy of Michael O'Reilly's patch to kmalloc fd sets so you can have lots of open files per process. Along with this patch, you'll want to do something like:

echo 8192 >/proc/sys/kernel/file-max echo 24576 >/proc/sys/kernel/inode-max

in your rc scripts. If you have multiple processes needing lots of files, you might want even bigger numbers above.

I've been running this patch in 2.0.29 for a few weeks in my IRC server, and have seen no serious problems. Some fine tuning in /proc/sys/vm/* would probably be helpful though.

------------------------------------------------------------------ Jon Lewis <jlewis@fdt.net> | Unsolicited commercial e-mail will Network Administrator | be proof-read for $199/hr. ________Finger jlewis@inorganic5.fdt.net for PGP public key_______

Modified by Carlo Wood (carlo@runaway.xs4all.nl) for linux-2.0.30

diff -rc linux-2.0.30/fs/select.c linux/fs/select.c *** linux-2.0.30/fs/select.c Thu May 15 15:42:52 1997 --- linux/fs/select.c Thu May 15 18:43:05 1997 *************** *** 21,26 **** --- 21,27 ---- #include <linux/errno.h> #include <linux/personality.h> #include <linux/mm.h> + #include <linux/malloc.h> #include <asm/segment.h> #include <asm/system.h> *************** *** 237,258 **** * Update: ERESTARTSYS breaks at least the xview clock binary, so * I'm trying ERESTARTNOHAND which restart only when you want to. */ asmlinkage int sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { int error; ! limited_fd_set res_in, in; ! limited_fd_set res_out, out; ! limited_fd_set res_ex, ex; unsigned long timeout; error = -EINVAL; if (n < 0) goto out; if (n > NR_OPEN) n = NR_OPEN; ! if ((error = get_fd_set(n, inp, &in)) || ! (error = get_fd_set(n, outp, &out)) || ! (error = get_fd_set(n, exp, &ex))) goto out; timeout = ~0UL; if (tvp) { error = verify_area(VERIFY_WRITE, tvp, sizeof(*tvp)); --- 238,288 ---- * Update: ERESTARTSYS breaks at least the xview clock binary, so * I'm trying ERESTARTNOHAND which restart only when you want to. */ + + #define roundbit(n, type) (((n) + sizeof(type)*8 - 1) & ~(sizeof(type)*8-1)) + + static unsigned long * save_fds[100] = {NULL, }; + static int fds_index = 0; + asmlinkage int sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { + unsigned long * fds = 0; int error; ! limited_fd_set *res_in, *in; ! limited_fd_set *res_out, *out; ! limited_fd_set *res_ex, *ex; unsigned long timeout; + int size; error = -EINVAL; if (n < 0) goto out; if (n > NR_OPEN) n = NR_OPEN; ! ! size = roundbit(NR_OPEN, unsigned long) / 8; ! if (save_fds[fds_index]) { ! fds = save_fds[fds_index]; ! save_fds[fds_index] = NULL; ! if (fds_index > 0) ! --fds_index; ! } else { ! fds = kmalloc(6 * size, GFP_KERNEL); ! } ! if (!fds) { ! error = -ENOMEM; ! goto out; ! } ! in = (limited_fd_set *) fds; ! out = (limited_fd_set *) (((char*)fds) + size); ! ex = (limited_fd_set *) (((char*)fds) + size*2); ! res_in = (limited_fd_set *) (((char*)fds) + size*3); ! res_out = (limited_fd_set *) (((char*)fds) + size*4); ! res_ex = (limited_fd_set *) (((char*)fds) + size*5); ! ! if ((error = get_fd_set(n, inp, in)) || ! (error = get_fd_set(n, outp, out)) || ! (error = get_fd_set(n, exp, ex))) goto out; timeout = ~0UL; if (tvp) { error = verify_area(VERIFY_WRITE, tvp, sizeof(*tvp)); *************** *** 263,279 **** if (timeout) timeout += jiffies + 1; } ! zero_fd_set(n, &res_in); ! zero_fd_set(n, &res_out); ! zero_fd_set(n, &res_ex); current->timeout = timeout; error = do_select(n, ! (fd_set *) &in, ! (fd_set *) &out, ! (fd_set *) &ex, ! (fd_set *) &res_in, ! (fd_set *) &res_out, ! (fd_set *) &res_ex); timeout = current->timeout - jiffies - 1; current->timeout = 0; if ((long) timeout < 0) --- 293,309 ---- if (timeout) timeout += jiffies + 1; } ! zero_fd_set(n, res_in); ! zero_fd_set(n, res_out); ! zero_fd_set(n, res_ex); current->timeout = timeout; error = do_select(n, ! (fd_set *) in, ! (fd_set *) out, ! (fd_set *) ex, ! (fd_set *) res_in, ! (fd_set *) res_out, ! (fd_set *) res_ex); timeout = current->timeout - jiffies - 1; current->timeout = 0; if ((long) timeout < 0) *************** *** 292,300 **** goto out; error = 0; } ! set_fd_set(n, inp, &res_in); ! set_fd_set(n, outp, &res_out); ! set_fd_set(n, exp, &res_ex); out: return error; } --- 322,339 ---- goto out; error = 0; } ! set_fd_set(n, inp, res_in); ! set_fd_set(n, outp, res_out); ! set_fd_set(n, exp, res_ex); out: + if (fds) { + if (fds_index < 95) { + if (save_fds[fds_index]) + ++fds_index; + save_fds[fds_index] = fds; + } else { + kfree(fds); + } + } return error; } diff -rc linux-2.0.30/include/linux/posix_types.h linux/include/linux/posix_types.h *** linux-2.0.30/include/linux/posix_types.h Mon Aug 5 09:13:54 1996 --- linux/include/linux/posix_types.h Thu May 15 18:43:38 1997 *************** *** 30,36 **** #define __NFDBITS (8 * sizeof(unsigned long)) #undef __FD_SETSIZE ! #define __FD_SETSIZE 1024 #undef __FDSET_LONGS #define __FDSET_LONGS (__FD_SETSIZE/__NFDBITS) --- 30,36 ---- #define __NFDBITS (8 * sizeof(unsigned long)) #undef __FD_SETSIZE ! #define __FD_SETSIZE 4096 #undef __FDSET_LONGS #define __FDSET_LONGS (__FD_SETSIZE/__NFDBITS) diff -rc linux/include/linux/fs.h linux.fail/include/linux/fs.h *** linux-2.0.30/include/linux/fs.h Thu May 15 17:50:25 1997 --- linux/include/linux/fs.h Thu May 15 13:00:12 1997 *************** *** 27,33 **** /* Fixed constants first: */ #undef NR_OPEN ! #define NR_OPEN 256 #define NR_SUPER 64 #define BLOCK_SIZE 1024 --- 27,33 ---- /* Fixed constants first: */ #undef NR_OPEN ! #define NR_OPEN 4096 #define NR_SUPER 64 #define BLOCK_SIZE 1024 *************** *** 36,43 **** /* And dynamically-tunable limits and defaults: */ extern int max_inodes, nr_inodes; extern int max_files, nr_files; ! #define NR_INODE 3072 /* this should be bigger than NR_FILE */ ! #define NR_FILE 1024 /* this can well be larger on a larger system */ #define MAY_EXEC 1 #define MAY_WRITE 2 --- 36,43 ---- /* And dynamically-tunable limits and defaults: */ extern int max_inodes, nr_inodes; extern int max_files, nr_files; ! #define NR_INODE 24576 /* this should be bigger than NR_FILE */ ! #define NR_FILE 8192 /* this can well be larger on a larger system */ #define MAY_EXEC 1 #define MAY_WRITE 2 diff -rc linux/include/linux/limits.h linux.fail/include/linux/limits.h *** linux-2.0.30/include/linux/limits.h Wed Jul 17 14:10:03 1996 --- linux/include/linux/limits.h Thu May 15 12:26:50 1997 *************** *** 1,7 **** #ifndef _LINUX_LIMITS_H #define _LINUX_LIMITS_H ! #define NR_OPEN 256 #define NGROUPS_MAX 32 /* supplemental group IDs are available */ #define ARG_MAX 131072 /* # bytes of args + environ for exec() */ --- 1,7 ---- #ifndef _LINUX_LIMITS_H #define _LINUX_LIMITS_H ! #define NR_OPEN 4096 #define NGROUPS_MAX 32 /* supplemental group IDs are available */ #define ARG_MAX 131072 /* # bytes of args + environ for exec() */