O_CLOEXEC (was Re: thread rant)

From: dean gaudet (dean-list-linux-kernel@arctic.org)
Date: Tue Sep 05 2000 - 15:06:57 EST


On Sat, 2 Sep 2000, Alexander Viro wrote:

> On Sat, 2 Sep 2000, Jamie Lokier wrote:
>
> > dean gaudet wrote:
> > > an example of brokenness in the traditional fd API is close-on-exec --
> > > there's a race between open()/socket()/pipe() and fcntl(FD_CLOEXEC) during
> > > which if another thread does a fork() it's possible the child will inherit
> > > an fd it shouldn't... working around it is painful. the model which
>
> Really? Like, say it, close() before exec()?
>
> > > NT/OS2 use for creating a new process scales better in the 99.99% case of
> > > stdin/out/err -- you only specify those fds you want to keep in the new
> > > process.
> >
> > An obvious solution presents itself. O_CLOEXEC.

yeah i proposed this and non-blocking inheritance from accept() when i was
doing the initial modelling for threaded apache-2.0 about 3 years ago. you
need O_CLOEXEC to be inherited through accept() as well. oh, and you also
want it on pipe(), socket(), and socketpair() to be complete. (if you
search linux-kernel archives you can find that this topic has come up
maybe 4 times in 3 years.)

> Even more obvious solution: close what you need to close if you have
> sensitive descriptors around. Close-on-exec is a kludge. If you have
> sensitive pieces of descriptor table you want to do some other things too
> - e.g. make sure that it gets unshared before exec(). Because new
> descriptors of that kind are very likely to follow...

if you'll forgive me for using webservers as an example again i can
maybe give some motivation.

suppose you're building a threaded server to scale to 20k connections,
and you want to support fork()/exec() of CGIs dirctly (rather than
through a separate daemon process).

if you don't use close-on-exec, then after fork before exec you have
to iterate over the 20k descriptors to close them all -- or you have
to walk your own data structures (a list of open sockets) to find what
descriptors are open and close them.

if you want to walk your own data structures then these structures need
to be coherent at the time of the fork -- which means semaphores to
add/delete from the list. this would mean an extra synchronization after
accept(). i suspect this would suck worse than iterating over 3..20000
and calling close -- think of all the cache line activity on an SMP box.

if you use O_CLOEXEC then it's not possible for an fd to be open which
isn't marked to be closed on the exec, and so you've no extra work to
do before exec.

but if you try to use FD_CLOEXEC there's a critical section between
the open/socket/accept/pipe/... and the fcntl(). you pretty much have
to put it under a semaphore (although you can get away with a rwlock),
or iterate over all the descriptors.

opendir() has a similar race, except it's hidden inside libc.

the most portable, and arguably cleaner solution is to use a separate
unthreaded process for forking and fd passing... as long as this meets
your needs.

fork/exec and threads just don't mix so well with lots of fds.
(a spawnve() API with an additional argument to indicate which fds to
dup would be better... similar to NT's CreateProcess)

i've a hairbrained idea for replacing fd integers with pointers which
might actually scale better if anyone wants to hear it again.

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Sep 07 2000 - 21:00:22 EST