Re: threaded apps and default file descriptor flags

Dean Gaudet (dgaudet-list-linux-kernel@arctic.org)
Sun, 24 May 1998 20:27:52 -0700 (PDT)


I wrote the below message a while back while the list was broken... or
at least I wasn't receiving messages. So I'm including it for those who
don't have the context. You may want to skip down and read it first.

A larger problem, which I didn't even attack in my previous message, is that
threaded servers can easily run out of file descriptors. There are the
following demands on file descriptors in a server such as Apache:

- log files
- listening network sockets
- client sockets
- outgoing proxy sockets
- read-only open file handles

I'll skip the demands that other libraries may impose, such as database
connections.

Consider for a moment also that Linux may need sendfile() (aka
TransmitFile in WIN32 lingo) to keep up at the top end. Why? Well you
may recall my post mentioning the read() vs. mmap() performance difference
when the working set gets larger than RAM (read() is ~4x faster).
DaveM gave me a test patch that narrowed the margin, but didn't close it.
Something like sendfile() could close it completely without complicated
gymnastics. sendfile() requires the server to cache open filehandles
(as opposed to mmaps, for which it can close the filehandle immediately
after the mmap call). And having hundreds to thousands of busy files is
not unreasonable for a high end server.

All these can easily add up to 1024 descriptors in a busy threaded server.
I honestly don't believe the right answer is to increase OPEN_MAX, I
believe that no matter what number we pick, we can blow it away with a
server design/configuration that is reasonable (and expected or known to
exist "in the field"). Furthermore, it is painful to copy these large
arrays of open files across fork() -- when in fact almost none of the
files are needed across fork(), usually only 3 handles are required.

So the question is, why have a static limit? To fuel the fire: WIN32
doesn't have a static limit. Here's my proposal, which supercedes some
of what I previously proposed.

Define an "extended file descriptor" or XFD to be an int which is greater
than 0, and also greater than the RLIMIT_NOFILE hard max for the process.

XFDs cannot be inherited across exec(). (optional: XFDs default to
close-on-exec, but I really don't think it's necessary to even support
inherit across exec.)

XFDs can be passed to read(), write(), send()/recv() (and relatives),
poll(), lseek(), close(), bind(), listen(), accept() ... I'm missing
many ... and behave the same way as regular FDs do.

XFDs are incompatible with select(), and cannot be used with it at all.

Extend open() with O_USE_XFD to cause it to return an XFD instead of a
regular FD.

Add F_DUPFDX opertion to fcntl() which will duplicate a file descriptor
creating an XFD. The F_DUPFD option will *only* create regular file
descriptors, as will dup() and dup2().

Create socket2(domain, type, protocol, flags), which is the same as
socket() except flags can take the values O_NONBLOCK, and O_USE_XFD.
Reason: proxy HTTP servers have to create sockets on the fly, without
this new syscall the application has to call socket() and fcntl(F_DUPFDX)
(and depending on the threading implementation -- fcntl(F_GETFL), and
fcntl(F_SETFL) to set O_NONBLOCK).

Create accept2(s, addr, addrlen, flags), similar to socket2(). (At this
point I'd argue for the creation of something like WIN32's AcceptRead()
... but that's getting a bit zealous.)

socketpair() and pipe() should also be similarly extended with a
flags parameter which applies to both halves created. An application
will almost always have to dup() one of these to regular descriptors
after fork()ing.

A clone() and fork() will share the same XFDs as the parent, although a
flag CLONE_CLOSE_XFDS could be added which would cause the child to have
no XFDS. It's expected that the usual fork(); dup a few descriptors;
exec() sequence will dup any needed XFDs to regular FDs, and then on
the exec() all the shared XFDs will be released.

That's pretty much the entire proposal.

As far as implementation goes, I think one pretty obvious way to do
this is to dedicate pages to XFDs and fill them with "struct file"s.
Add "struct xfds_struct { int count; }", and a pointer "struct xfds
*xfds" to struct task_struct. Add "struct xfds_struct *xfds_owner;" to
struct file -- owner is NULL means that the file is either not an XFD,
or it is a closed XFD. Then when the user passes a supposed XFD to the
kernel, the kernel first maps the handle to a page number, and checks
if the offset is a multiple of sizeof(struct file). Then it checks if
the owner pointer is the same as the task's xfds pointer.

You'll notice the proposed semantics give no way for a task to iterate
over its xfds... I don't see a need for it, the process almost always
needs its own listing of all the open file handles it has, so let it
keep track if it wants to. That let's the "struct xfds_struct" be
so simple.

And this implementation can be extended easily to avoid per-processor
locks... just keep separate XFD pages per processor...

XFDs in this form would work well for applications such as Apache,
Squid, and pretty much every network server that I can think of right
now. They have the advantage that the fixed-size set of regular
file descriptors can be kept reasonably small, which reduces the need
for special kernel code to support variable sized files_structs.
It requires the application to be written very carefully to avoid
select() everywhere -- but, for example, I've already got Apache
ported to the NSPR library which hides all of the details. I would
be able to put the code that understands/knows about XFDs into
NSPR and it would be completely hidden from applications which use
NSPR.

Dean

On Sun, 10 May 1998, Dean Gaudet wrote:

> As I work more on the threaded port of Apache I'm starting to really curse
> the default unix semantics on file descriptors. In particular I want all
> my descriptors (sockets, files, whatever) to be opened O_NONBLOCK, and to
> be *non-inheritable across exec* (both are the opposite of the unix
> defaults). This is a long description of the problems and solutions,
> fodder for folks thinking about linux 2.3.
>
> Consider a busy threaded webserver, it could easily be pushing the 1024
> limit of open file descriptors, and trying to fork CGIs left and right.
> AFAIK, my current choices for the exec() problem are:
>
> - do fcntl(F_SETFD) on every file descriptor as it is created
>
> - remember all the open filehandles, and after fork() do close() on
> all the necessary ones
>
> Apache currently implements the second option, because in a multiprocess
> setting it's not very expensive -- there's typically a dozen or so
> filehandles to close. But in a threaded server it sucks because there
> could be hundreds of filehandles to close, and these are allocated by
> multiple threads so keeping a list of them is painful -- my threads don't
> have to synchronize for very many things at all, and I don't want to add
> this to the list.
>
> The first option totally sucks, enough said.
>
> I took a brief look at Solaris man pages, because I know Sun has been
> doing heavy threading work for some time, but I couldn't see any solution
> to these problems. Maybe someone else knows of something like this in
> other unixes. The default on WIN32 is for opened files/sockets to be
> "close-on-exec" (excuse the abuse of terminology).
>
> I essentially want to be able to set two flags for the process. Maybe as
> parameters to personality(), doesn't matter to me where, I just care about
> the semantics. Like this:
>
> NONBLOCK_DEFAULT
> If set, all descriptors created in this process will be created
> with O_NONBLOCK set. This includes descriptors created by open(),
> dup() (and F_DUPFD), accept(), socket(), socketpair(), and pipe().
> This flag is inherited across clone()/fork(), but is cleared on
> exec().
>
> CLOEXEC_DEFAULT
> If set, all descriptors created in this process will be created
> with FD_CLOEXEC set. This includes descriptors created by open(),
> dup() (and F_DUPFD), accept(), socket(), socketpair(), and pipe().
> This flag is inherited across clone()/fork(), but is cleared on
> exec().
>
> Note that socketpair() and pipe() are generally only of use when creating
> children to read/write them. It's almost always the case that you'll have
> to remove O_NONBLOCK or FD_CLOEXEC on one of the two filedescriptors
> created. But that's the same amount of work you'd have to do if you had
> to add those flags to the other descriptor... and so for consistency I'd
> rather have to remove the flags.
>
> dup() is almost always used in a similar setting and could easily clear
> O_NONBLOCK or FD_CLOEXEC. I don't care either way -- I just suggested it
> for consistency.
>
> F_DUPFD, on the other hand, has some very specific uses that dup() doesn't
> get used for. We use it in Apache to guarantee that there are a few
> descriptors in the 3..15 range available for use by 3rd party libraries
> (the ones that generally have pre-compiled FD_SETSIZE limitations or other
> such nonsense). In multithreaded apache this isn't even feasible, and I'm
> just not worrying about it. But again... I'm just trying to make the
> above flags as consistant as possible.
>
> The last paragraph gives us a hint as to why this interface is completely
> broken: 3rd party libraries (remember that the Apache which we "ship"
> can be linked against anything by apache module developers, I have to
> worry about how things will behave in that situation). It's unlikely,
> but definately impossible, for a 3rd party library to care about the
> FD_CLOEXEC flag. On the other hand, it's very likely to be confused
> by O_NONBLOCK.
>
> So perhaps this draconian process-wide option isn't the best.
> Let's consider the other end of the scale. I really only need this
> functionality on sockets created by accept(), and files opened by open().
> open() already has O_NONBLOCK, it only needs O_CLOEXEC, and I'd be happy.
> Then for accept() if there were two flags I could set via fcntl() --
> O_INHERIT_NONBLOCK, and O_INHERIT_CLOEXEC, which affected the new socket
> created, that would satisfy my needs. Like I said, for pipe()s I already
> have to modify one half of the pipe.
>
> This solution allows 3rd party libraries to get the default semantics
> on stuff they open() or accept(), while apache itself gets the semantics
> it needs. But it's unlikely that a 3rd party library is going to mark a
> descriptor as close-on-exec... and I just realised that apache probably
> already has problems with CGIs getting 3rd party library descriptors to
> play with. I bet there are some DoS attacks and even worse possible right
> now because a socket to a database remains open across an exec() of a CGI.
>
> There are two cases to consider:
>
> - apache itself does the exec(), in which case it knows that everything
> it opened will be closed because it used O_INHERIT_CLOEXEC or
> O_CLOEXEC. But it doesn't know if a 3rd party library opened
> something -- if it did then it almost certainly shouldn't be
> inherited.
>
> - a 3rd party library does the exec(). In which case it knows only about
> stuff it opened -- and will likely do the right thing with those.
> If apache used O_INHERIT_CLOEXEC or O_CLOEXEC then those, too,
> will be handled properly.
>
> I'm willing to argue that 3rd party libraries (or the apache modules which
> invoke them) should handle their own descriptors' FD_CLOEXEC flags.
>
> So I think this second interface is the most useable.
>
> Dean
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu