Re: [RFC 00/60] Coscheduling for Linux
From: Jan H. SchÃnherr
Date: Mon Sep 24 2018 - 11:24:09 EST
On 09/18/2018 04:40 PM, Rik van Riel wrote:
> On Fri, 2018-09-14 at 18:25 +0200, Jan H. SchÃnherr wrote:
>> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
>>> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. SchÃnherr wrote:
>>>>
>>>> B) Why would I want this?
>>>> [one quoted use case from the original e-mail]
>
> What are the other use cases, and what kind of performance
> numbers do you have to show examples of workloads where
> coscheduling provides a performance benefit?
For further use cases (still an incomplete list) let me redirect you to the
unabridged Section B of the original e-mail:
https://lkml.org/lkml/2018/9/7/1521
If you want me to, I can go into more detail and make the list from that
e-mail more complete.
Note, that many coscheduling use cases are not primarily about performance.
Sure, there are the resource contention use cases, which are barely about
anything else. See, e.g., [1] for a survey with further pointers to the
potential performance gains. Realizing those use cases would require either
a user space component driving this, or another kernel component performing
a function similar to the current auto-grouping with some more complexity
depending on the desired level of sophistication. This extra component is
out of my scope. But I see a coscheduler like this as an enabler for
practical applications of these kind of use cases.
If you use coscheduling as part of a solution that closes a side-channel,
performance is a secondary aspect, and hopefully we don't lose much of it.
Then, there's the large fraction of use cases, where coscheduling is
primarily about design flexibility, because it enables different (old and
new) application designs, which usually cannot be executed in an efficient
manner without coscheduling. For these use cases performance is important,
but there is also a trade-off against development costs of alternative
solutions to consider. These are also the use cases where we can do
measurements today, i.e., without some yet-to-be-written extra component.
For example, with coscheduling it is possible to use active waiting instead
of passive waiting/spin-blocking on non-dedicated systems, because lock
holder preemption is not an issue anymore. It also allows using
applications that were developed for dedicated scenarios in non-dedicated
settings without loss in performance -- like an (unmodified) operating
system within a VM, or HPC code. Another example is cache optimization of
parallel algorithms, where you don't have to resort to cache-oblivious
algorithms for efficiency, but where you can stay with manually tuned or
auto-tuned algorithms, even on non-dedicated systems. (You're even able to
do the tuning itself on a system that has other load.)
Now, you asked about performance numbers, that *I* have.
If a workload has issues with lock-holder preemption, I've seen up to 5x to
20x improvement with coscheduling. (This includes parallel programs [2] and
VMs with unmodified guests without PLE [3].) That is of course highly
dependent on the workload. I currently don't have any numbers comparing
coscheduling to other solutions used to reduce/avoid lock holder
preemption, that don't mix in any other aspect like resource contention.
These would have to be micro-benchmarked.
If you're happy to compare across some more moving variables, then more or
less blind coscheduling of parallel applications with some automatic
workload-driven (but application-agnostic) width adjustment of coscheduled
sets yielded an overall performance benefit between roughly 10% to 20%
compared to approaches with passive waiting [2]. It was roughly on par with
pure space-partitioning approaches (slight minus on performance, slight
plus on flexibility/fairness).
I never went much into the resource contention use cases myself. Though, I
did use coscheduling to extend the concept of "nice" to sockets by putting
all niced programs into a coscheduled task group with appropriately reduced
shares. This way, niced programs don't just get any and all idle CPU
capacity -- taking away parts of the energy budget of more important tasks
all the time -- which leads to important tasks running at turbo frequencies
more often. Depending on the parallelism of niced workload and the
parallelism of normal workload, this translates to a performance
improvement of the normal workload that corresponds roughly to
the increase in frequency (for CPU-bound tasks) [4]. Depending on the
processor, that can be anything from just a few percent to about a factor
of 2.
Regards
Jan
References:
[1] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto,
âSurvey of scheduling techniques for addressing shared resources in
multicore processors,â ACM Computing Surveys, vol. 45, no. 1, pp.
4:1â4:28, Dec. 2012.
[2] J. H. SchÃnherr, B. Juurlink, and J. Richling, âTACO: A scheduling
scheme for parallel applications on multicore architectures,â
Scientific Programming, vol. 22, no. 3, pp. 223â237, 2014.
[3] J. H. SchÃnherr, B. Lutz, and J. Richling, âNon-intrusive coscheduling
for general purpose operating systems,â in Proceedings of the
International Conference on Multicore Software Engineering,
Performance, and Tools (MSEPT â12), ser. Lecture Notes in Computer
Science, vol. 7303. Berlin/Heidelberg, Germany: Springer, May 2012,
pp. 66â77.
[4] J. H. SchÃnherr, J. Richling, M. Werner, and G. MÃhl, âA scheduling
approach for efficient utilization of hardware-driven frequency
scaling,â in Workshop Proceedings of the 23rd International Conference
on Architecture of Computing Systems (ARCS 2010 Workshops), M. Beigl
and F. J. Cazorla-Almeida, Eds. Berlin, Germany: VDE Verlag, Feb.
2010, pp. 367â376.