I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more "free" cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration.
Adding system calls, while great for research, is not something which is done lightly in the published kernel. If we're going to implement a user interface beyond simply interpreting existing priorities more precisely, it would be nice if this was part of a framework with a broader vision, such as a scheduler economy.
Scheduling Algorithm:
The scheduler keeps a set data structures, called Trio groups, to maintain the weight or reservation of each thread group (including one or more threads) and the local weight of each member thread. When scheduling a thread, it consults these data structures and computes (in constant time) a system-wide weight for the thread that represents an equivalent CPU share. Consequently, the scheduling algorithm, DWRR, operates solely based on the system-wide weight (or weight for short, hereafter) of each thread. Having a flat space of system-wide weights for individual threads avoids performing seperate scheduling at each level of the group hierarchy and thus greatly simplies the implementation for group scheduling.
Implementing a flat weight space efficiently is nontrivial. I'm curious to see how you reworked the original patch without global locking.
I had a feeling this patch was originally designed for the O(1) scheduler, and this is why. The old scheduler had expired arrays, so adding a round-expired array wasn't a radical departure from the design. CFS does not have an expired rbtree, so adding one *is* a radical departure from the design. I think we can implement DWRR or something very similar without using this implementation method. Since we've already got a tree of queued tasks, it might be easiest to basically break off one subtree (usually just one task, but not necessarily) and migrate it to a less loaded tree whenever we can reduce the difference between the load on the two trees by at least half. This would prevent both overcorrection and undercorrection.
The idea of rounds was another implementation detail that bothered me. In the old scheduler, quantizing CPU time was a necessary evil. Now that we can account for CPU time with nanosecond resolution, doing things on an as-needed basis seems more appropriate, and should reduce the need for global synchronization.
In summary, I think the accounting is sound, but the enforcement is sub-optimal for the new scheduler. A revision of the algorithm more cognizant of the capabilities and design of the current scheduler would seem to be in order.
I've referenced many times my desire to account for CPU/memory hierarchy in these patches. At present, I'm not sure we have sufficient infrastructure in the kernel to automatically optimize for system topology, but I think whatever design we pursue should have some concept of this hierarchy, even if we end up using a depth-1 tree in the short term while we figure out how to optimize this.