Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().

From: Stephen Champion
Date: Tue Aug 05 2008 - 23:22:55 EST


Eric W. Biederman wrote:
Robin Holt <holt@xxxxxxx> writes:

But if we simply scale based upon num_possible_cpus(), we get a relatively
representative scaling function. Usually, customers buy machines with 1,
2, or 4GB per cpu. I would expect a waste of 256k, 512k, or even 1m to
be acceptable at this size of machine.

For your customers, and your kernel thread workload, you get a
reasonable representation. For other different people and different
workloads you don't. I happen to know of a completely different
class of workload that can do better.

Although Robin probably had broader experience, I think we have both had opportunity to examine the workloads and configuration of a reasonable sample of the active (and historical) large (>=512c) shared memory systems.

Some workloads and configurations are specialized, and perhaps less stressing that the mixed, volatile loads and array of services most of these systems are expected to handle, but the specialized loads have been the exceptions in my experience. That may change as the price/core continues to go down and pseudo-shared memory systems based on cluster interconnects become more common and possibly even viable, but don't hold your breathe.

For 2.6.27, would you accept an upper cap based on the memory size
algorithm you have now and adjusted for num_possible_cpus()? Essentially
the first patch I posted.

I want to throw a screaming hissy fit.

If those get more cycles to my users, I'll start reading the list religiously!

The merge window has closed. This is not a bug. This is not a
regression. I don't see a single compelling reason to consider this
for 2.6.27. I asked for clarification so I could be certain you were
solving the right problem.

Early in 2.6.28 might work for us. 2.6.27 would be nice. Yes, we'd like a distribution vendor(s) to pull it. If we ask nicely, the one which matters to me (and my users) is quite likely to take it if it has been accepted early in the next cycle. They've been very good about that sort of thing (for which I'm very thankful). So while it's extra administrivia, I'm not the one who has to fill out the forms and write up the justification ;-)

But the opposite question: Does the patch proposed have significant risk or drawbacks? We know it offers a minor but noticeable performance improvement for at least some of the small set of systems it effects. Is it an unreasonable risk for other systems - or is there a known group of systems it would have an affect on which would not benefit or might even harm? Would a revision of it be acceptable, and if so, (based on answers to the prior questions) what criteria should a revision meet, and what time frame should we target?

Why didn't these patches show up 3 months ago when the last merge
window closed? Why not even earlier?

It was not a high priority, and I didn't push on it until after the trouble with proc_pid_readdir was resolved (and the fix floated downstream to me). Sorry, but it was lost in higher priority work, and not something nagging at me, as I had already made the change on the systems I build for.

I totally agree that what we are doing could be done better, however
at this point we should be looking at 2.6.28. In which case looking
at the general long term non-hack solution is the right way to go. Can
we scale to different workloads?

For everyone with less then 4K cpus the current behavior is fine, and
with 4k cpus it results in a modest slowdown. This sounds useable.

I'd say the breakpoint - where increasing the size of the pid hash starts having a useful return - is more like 512 or 1024. On NUMA boxes (which I think is most, if not all of the large processor count systems), running a list in the bucket (which more often than not will be remote) can be expensive, so we'd like to be closer to 1 process / bucket.

You have hit an extremely sore spot with me. Anytime someone makes an
argument that I hear as RHEL is going to ship 2.6.27 so we _need_ this
patch in 2.6.27 I want to stop listening. I just don't care. Unfortunately
I have heard that argument almost once a day for the last week, and I am
tired of it.

Only once a day? Easy silly season, for having two major distributions taking a snapshot on 2.6.27... I can see that getting annoying, and it's an unfortunate follow on effect of how Linux gets delivered to users who require commercial support and/or 3rd party application certifications for whatever reason (which unfortunately includes my users)... Developers and users both need to push the major distributions to offer something reasonably current - we're both stuck with this silliness until users can count on new development being delivered in something a bit shorter than two years...

Caught in the middle, I ask both sides to push on the distributions at every opportunity! <push push>.

Why hasn't someone complained that waitpid is still slow?

Is it? I hadn't noticed, but I usually only go for the things users are in my cubicle complaining about, and I'm way downstream, so if it's not a problem there, I won't notice until I can get some time on a system to play with something current (within the next week or two, I hope). I can look then, if you'd like.

Why haven't we seen patches to reduce the number of kernel threads since
last time you had problems with the pid infrastructure?

A very frustrated code reviewer.

So yes. If you are not interested in 2.6.28 and in the general problem,
I'm not interested in this problem.

Is there a general problem?

The last time we had trouble with the pid infrastructure, I believe it was the result of a patch leaking through, which, frankly, was quite poor. I believe it's deficiencies have been addressed, and it looks like we now have a respectable implementation which should serve us well for a while.

There certainly is room for major architectural improvements. Your ideas for moving from a hash to a radix are a good direction to take, and are something we should work on as processor counts continue to grow. It is likely that we stand to gain in both raw cycles consumed as well as memory consumption - but we're not going to see that tomorrow.

I would think reducing process counts is also is a longer term project. I wouldn't be looking at 2.6.28 for that, but rather 2.6.30 or so. Most (possibly all) of the worst offenders appear to be using create_workqueue, which I don't expect will be trivial to change. If someone picked up the task today, it might be ready for 2.6.29, but we may want more soak time, as it looks to me like an intrusive change with a high potential for unexpected consequences.

From where I'm sitting, the current mechanism seems to do reasonably well, even with very large numbers of processes (hundreds of thousands), provided that the hash table is large enough to account for increased use. The immediate barrier to adequate performance on large systems (that is, not unnecessarily wasting a significant portion of cycles) is the unreasonably low cap on the size of the hash table: it's an artificial limit, based on an outdated set of expectations about the sizes of systems. As such, it's easy to extend the useful life of the current implementation with very little cost or effort.

A major rework with more efficient resource usage may be a higher priority for someone looking at higher processor counts with (relatively) tiny memory sizes. If such people exist, it should not be difficult to take them into account when sizing the existing pid hash.

That's a short term (tomorrow-ish), very low risk project with immediate benefit: a small patch with no effect on systems <512c, which grows the pid hash when it is likely to be beneficial and there is plenty of memory to spare.

I'd really like to see an increased limit to the size of the pid hash in the near term. If we can reduce process counts, we might revisit the sizing. Better would be to start work on a more resource efficient implementation to eliminate it before we have to revisit it. Ideal would be to move ahead with all three. I don't see any (sensible) reason for any of these steps to be mutually exclusive.

--
Stephen Champion Silicon Graphics Site Team
schamp@(sgi.com|nas.nasa.gov) NASA Advanced Supercomputing
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/