Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report

From: Peter Zijlstra
Date: Mon Apr 25 2016 - 05:34:31 EST


On Sat, Apr 23, 2016 at 06:38:25PM -0700, Brendan Gregg wrote:
> On Sat, Apr 23, 2016 at 11:20 AM, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
> >
> > Interesting read.
> >
> > http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
> >
> > "... The Linux kernel scheduler has deficiencies that prevent a
> > multicore system from making proper use of all cores for heavily
> > multithreaded loads, according to a lecture and paper delivered
> > earlier this month at the EuroSys '16 conference in London, ..."
> >
> > Any plans to incorporate these fixes?

No; their patches are completely butchering things. Also, I don't think
I agree with some of their analysis.

Sadly the paper doesn't provide enough detail to fully reproduce things.
Nor have I had time to really look into things yet. I was only made
aware of this paper last week -- it was so good of these here folks to
contact me,. oh wait.

> While this paper analyzes and proposes fixes for four bugs, it has
> been getting a lot of attention for broader claims about Linux being
> fundamentally broken:
>
> "As a central part of resource management, the OS thread scheduler
> must maintain the following, simple, invariant: make sure that ready
> threads are scheduled on available cores.

This is actually debatable. This is a global problem, therefore it is
expensive. It can take more work to find a runnable task than we would
have been idle for in the first place.

> As simple as it may seem, we
> found that this invariant is often broken in Linux. Cores may stay
> idle for seconds while ready threads are waiting in runqueues."

Right, obviously seconds is undesirable.

> Then states that the problems in the Linux scheduler that they found
> cause degradations of "13-24% for typical Linux workloads".
>
> Their proof of concept patches are online[1]. I tested them and saw 0%
> improvements on the systems I tested, for some simple workloads[2]. I
> tested 1 and 2 node NUMA, as that is typical for my employer (Netflix,
> and our tens of thousands of Linux instances in the AWS/EC2 cloud),
> even though I wasn't expecting any difference on 1 node. I've used
> synthetic workloads so far.

So their setup uses a bigger (not fully connected) NUMA topology, and
I'm not entirely sure how much of their problems are due to that, but at
least one of them is.

Such boxes are fairly rare.

In any case, I'll get to it at some point...