Re: RFC [patch 13/34] PID Virtualization Define new task_pid api

From: Eric W. Biederman
Date: Wed Feb 01 2006 - 02:14:25 EST


Linus Torvalds <torvalds@xxxxxxxx> writes:

> On Tue, 31 Jan 2006, Eric W. Biederman wrote:
>>
>> Yes. Although there are a few container lifetimes problems with that
>> approach. Do you want your container alive for a long time after every
>> process using it has exited just because someone has squirrelled away their
>> pid. While container lifetime issues crop up elsewhere as well PIDs are
>> by far the worst, because it is current safe to store a PID indefinitely
>> with nothing worse that PID wrap around.
>
> Are people really expecting to have a huge turn-over on containers? It
> sounds like this shouldn't be a problem in any normal circumstance:
> especially if you don't even do the "big hash-table per container"
> approach, who really cares if a container lives on after the last process
> exited?

Turn over rate is a good argument, for not worry about things too much.
I guess it only really becomes a problem if you have large amounts
of resources locked up.

> I'd have expected that the major user for this would end up being ISP's
> and the like, and I would not expect the virtual machines to be brought up
> all the time.

People doing server consolidation are one of the big user bases. The other
and possibly a bigger driver force right now are people dealing with large
high performance clusters. There the interest is in encapsulating applications
so that you can checkpoint or migrate them.

One container per batch job might not be too high but if wound up being used
for short jobs as well as long ones you could get as high as one container
every couple of minutes.

The scary part with lifetime issues is that if you aren't careful you can
have lots of system resources used with no obvious source.

> If it's a problem, you can do the same thing that the "struct mm_struct"
> does: it has life-time issues because a mm_struct actually has to live for
> potentially a _long_ time (zombies) but at the same time we want to free
> the data structures allocated to the mm_struct as soon as possible,
> notably the VMA's and the page tables.
>
> So a mm_struct uses a two-level counter, with the "real" users (who need
> the page tables etc) incrementing one ("mm_users"), and the "secondary"
> ones (who just need to have an mm_struct pinned, but are ok with an empty
> VM being attached) incrementing the other ("mm_count").

Neat. I had not realized that was what was going on. Having clean up a bunch
of cases there ages ago I was about to feel silly but I just realized mmdrop
is more recent than my comment explaining the difference between mmput and
mm_release.

One of the suggestions that has been floating around was to replace the
saved pids with references to task structures in places like fown_struct.
If we were to take that approach we would have nasty lifetime issues, because
we would continue to pin processes even after they were no longer zombies,
and we can potentially get a lot of fown_structs.

So I am considering introducing an intermediary (on very similar lines
to what you were suggesting) a struct task_ref that is just:
struct task_ref
{
atomic_t count;
enum pid_type type;
struct task_struct *task;
};

That can be used to track tasks and process groups. I posted fairly
complete patches for review a few days ago. The interesting thing
with this case is that it can solve the pid wrap around issues as well
as container reference issues, by completely removing the need for
them.

The other technique that has served me well in my network
virtualization work was to setup of a notifier and have everyone who
cared register a notifier and drop their references when the notifier
was called. For a low number of things that care as this works very
well.

> (for "mm_struct", the primary is dropped "mmput()" and the secondary is
> dropped with "mmdrop()", which is absolutely horrid naming. Please name
> things better than I did ;)

Well it is a challenge there aren't that many good names around and
it is hard work to find them. :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/