clone() stack, mmap() questions and IPC idea

From: Michael Shell (mikes1987@yahoo.com)
Date: Fri Jul 21 2000 - 03:40:48 EST


 Hello,

I have a few questions about clone() and mmap(). Now, I understand that
clone() is not recommended for user programs (they normally should use
the Linux Threads interface). However, for my application, it seems that

clone() and System V semaphores may be all I need. I like very much the
concept of allowing tasks to share resources, yet still have the kernel
handle the scheduling. To this day, I think this concept is
misunderstood and lost in the "threads vs processes" war.

After these questions, I have a suggestion for Inter-Task Sync calls.

My first question concerns the prototype for clone(). Clone() _used_
to have the form: int clone(int flags); as described in Johnson and
Troan's Linux Application Development (LAD) book. However, clone()
now seems to have a more complex form involving pointers to a
function and allocated stack space - even a nice little ASM call
via Linus to make sure the new stack is transferred properly (I think
this now already done for us with glibc2 __clone().) Now, with
the old clone(), when used with the CLONE_VM flag, the tasks would share

even the stack space! Ouch. - What was the thinking here with the
original clone? How was it to be intended to be used if the two task's
shared a stack when they shared their virtual address space? Was it to
be the first task [pun] of the cloned task to make a new stack for
itself? When did the change to the new clone() take place?

One thing that I did not like was the fact that I would have to manage
stack spaces. I thought that this should be done for me by the kernel as

is the case with scheduling. I even began to wonder how the kernel
handles
the stack with fork() - how does one guess the size of that which cannot

be guessed? However, something caught my eye in mmap() - MAP_GROWSDOWN.
ahhh, so we setup a barrier page, catch segmentation faults, then add
a page at the end! Neato.

Question (group ;)) #2:
Ok, so can I use mmap() with MAP_GROWSDOWN to make myself a stack for
my clone? Will the kernel auto-grow this stack and not bug any of my
tasks
with SIGSEGV as it relates to the growth in stack sizes (within
rlimits)?
What initial size does the kernel use for a fork()? Linux threads uses
a single page for the initial stack sizes with a 2MB gap between them -
is this the result of way MAP_GROWSDOWN automatically works, or did
Xavier Leroy build his own stack manager? (I know, I know,- read the
source, but if one knows off the top of their head.. - he does seem to
use
malloc()- at least for the thread manager)

#3: If MAP_GROWSDOWN auto grows things, how on earth do I free the stack

space when my clone dies? Does it's death auto free those pages used for

stack? If the parent (who mapped it to begin with) has to free it, how
do
I know the size to munmap() as the kernal has increased the size of the
stack as needed? On a related note, if a clone mmaps PRIVATE, will that
memory be freed on the clone's death, how about with MAP_SHARED? (Both
of
these should be shared with the parent in any case due to CLONE_VM -
right?). Assume anonymous shared mappings (latest kernal supports this).

I also would like to humbly suggest, that I have this feeling that
"there outta be" a better, faster, easier, way to sync tasks and provide

consistant, faster IPC data than the tools currently available. Like
perhaps
a kernel call that gives atomic updates to structures in memory, kinda
like an atomic memcpy (perhaps with reasonable limits on the length of
the
structure so that malicous users can't do an atomic 100MB update). How
about
a pause() that updates a given shared memory flag (int *), with a given
value, when the task is in the "special" pause() and sets another value
when it awakes without the need to use another system call to check
sleeping
status - sleeps may occur for other reasons and there is a race if you
set
the flag just before the pause (Parent reads the flag and it says the
child
is sleeping, but it has not reached the pause, you send a signal, but it

arrives before the child reaches the pause and it then blocks in the
pause()). I think, based on my limited knowledge, that these would be
very
valuable with applications like Apache which maintain a shared memory
"scoreboard" and also allow the parent process to setup and monitor a
"pause gate" for the children such as just before an accept() for the
socket. (See something incomming via select(), choose from paused
children,
fire a signal to one child to allow it to handle the accept(), monitor
the
child's progress via the scoreboard - no need to worry about
inconsistent
fields or semaphores or other sysV stuff, as all scoreboard updates are
atomic. If an unexpected event should occur, such as the abnormal death
of a
child, no problem, the sigchld handler will update the scoreboard
atomically
to report this as well.

mutex_locks are fine, but they do seem like an overkill just to
update a couple of shared variables, and they block many children
which must result in a relatively large overhead as children are
blocked and restarted just because of the locks. Contrast this with
with a mere dozen or so asm instructions to set a few vars and no
changes going on in the scheduler as a result.

The idea is that the shared memory scoreboard does not have to reflect
what is going on to the nearest clock cycle (IPC will always have
nonzero
transient times) but it must _always_ have consistent data - like
transactions in an SQL database.
The other concept here is the ability of a parent or other shared
task(s),
to know, except for major faults, where a child task is "from this
chosen
point on" - i.e. a pause is combined atomically with a way to inform the

parent, or other related task(s), that the pause has been reached
without
resorting to heavy weight sledgehammers like pipes.
Perhaps the child can freed via another system call bypassing the use of

signals in the conventional way, altogether.

Like: int atomcpy(void *dest,void *src,size_t n);

       returns 0 on success, -1 if error and "transaction" copy did not
       occur. memory regions must not overlap
and
       int pausegate(int * flag, int invalue, int outvalue);

                     flag being a pointer within shared memory
       returns 0 if woken by signal, -1 if some abnormal condition
       has caused a return (could not set flag, etc)

it may be good idea to extend this to sigsuspend() as well so that the
child has a chance to block all other signals so that all other nonfatal

signals cannot wake the child until the parent fires its chosen "go"
signal.

This also would be a way around the "Thundering Herd" usage spike that
happens when lots of children are blocked in an accept() and all must be

woken even though only one gets to service it. At least we _can_ have
several children blocking on an accept(), if we have a single listen()
socket, with Linux just as under BSD.

 Sorry for the length of this post! ;)

 In any event, Thank you all so much for Linux!

  Michael Shell

  mikes1987@yahoo.com

  mikes1987 AT yahoo DOT COM (email address filter defeater version)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jul 23 2000 - 21:00:15 EST