Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().

From: Eric W. Biederman
Date: Mon Aug 04 2008 - 16:43:27 EST


Stephen Champion <schamp@xxxxxxx> writes:

> Despite more recent changes in proc_pid_readdir, my results should apply to
> current source. It looks like both the old 2.6.16 implementation and the
> current version will call find_pid (or equivalent) once for each successive
> getdents() call on /proc, excepting when the cursor is on the first entry. A
> quick look, and we have 88 getdents64() calls both 'ps' and 'ls /proc' with 29k
> processes running, which appears to be the primary source of calls.

Yes. proc_pid_readdir->next_tgid->find_ge_pid does a hash table lookup.

> It's not giganormous, although I probably could come up with a pointless
> microbenchmark to show it's 300% better. Importantly, it does noticeably
> improve normal interactive tools like 'ps' and 'top', a performance
> visualization tool developed by a customer (nodemon) refreshes faster. For a
> 512k init allocation, that seems like a very good deal.

Alright. The fact that it really is tools like ps and top that are proc intensive
shapes my opinion somewhat. At it is doing lots and lots of hash tables lookups
so we can see each process that is expensive. Rather then simple one at
a time lookups.

Hash table sizing gets us into a guessing game of how many pids do we expect
to deal with on this work load. And we guess by either memory (how many hash
table entries can we afford) or cpu how many hash table entries can we expect.

There is an alternative to growing the hash table that I was playing
with at one time. Use a radix tree that has essentially the same
properties for insert and delete as today's hash table. rcu for lookup
and a lock (spin lock?) for insert (memory allocation for the intermediate
nodes is the tricky bit).

Because pids are naturally dense we can accelerate the /proc accesses
by walking the radix tree and with a good implementation we only spend
as much memory as we have pids to worry about.

This is particular appealing to me in the context of pid namespaces.

I did not pursue it at the time I thought of it because I could not think
of a real workload that had enough pids to make the performance of the
current pid hash be noticable. For our normal workload today my rough numbers
said the pid hash is optimal.

The interesting twist to this tale is that you have 29k processes, and
20k of them are kernel threads. Which seems to say that it isn't user
space that is putting a heavy workloads on the pid infrastructure but
all of the per cpu kernel threads.

At only 9k processes we will have an average hash chain length of 2.25
which should perform well, as we have a good hash function and hash chain
lengths cluster well. This is as opposed to the average hash chain length
of 7.25 which you are seeing now.

It looks to me like a cap of 64k is excessive we could place it down
at 16k (pid_hash_shift = 14) and your machines should perform well in
ps and top related activities, and it would only cost 128k. If
you killed the excess kernel processes you could live very happily
with a 8k (pid_hash_shift = 13).

If we want something that tunes to the work load I expect a radix tree
would be best. If the goal after 4k cpus is 64k cpus which I heard someone
mention I think that is what you really want. A design that scales to
the workload on the system.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/