Re: [PATCH] sched: properly account IRQ and RT load in SCHED_OTHERload balancing

From: Gregory Haskins
Date: Thu Aug 21 2008 - 08:28:35 EST


Ingo Molnar wrote:
* Gregory Haskins <ghaskins@xxxxxxxxxx> wrote:

I haven't had a chance to review the code thoroughly yet, but I had been working on a similar fix and know that this is sorely needed. So...

btw., why exactly does this patch speed up certain workloads? I'm not quite sure about the exact reasons of that.

Ingo

I used to have a great demo for the prototype I was working on, but id have to dig it up. The gist of it is that the pre-patched scheduler basically gets thrown for a completely loop in the presence of a mixed CFS/RT environment. This isn't a PREEMPT_RT specific problem per se, though PREEMPT_RT does bring the problem to the forefront since it has so many active RT tasks by default (for the IRQs, etc) which make it more evident.

Since an RT tasks previous usage of declaring "load" did not actually express the true nature of the RQ load, CFS tasks would have a few really nasty things happen to them while trying to run on the system simultaneously. One of them was that you could starve out CFS tasks from certain cores (even though there was plenty of CPU bandwidth available elsewhere) and the load-balancer would think everything is fine and thus fail to make adjustments.

Say you have a 4 core system. You could, for instance, get into a situation where the softirq-net-rx thread was consuming 80% of core 0, yet the load balancer would still spread, say, a 40 thread CFS load evenly across all cores (approximately 10 per core, though you would account for the "load" that the softirq thread contributed too). The threads on the other cores would of course enjoy 100% bandwidth, while the ~10 threads on core 0 would only see 1/5th of that bandwidth.

What it comes down to is that the CFS load should have been evenly distributed across the available bandwidth of 3*100% + 1*20%, not 4*100% as it does today. The net result is that the application performs in a very lopsided manner, with some threads getting significantly less (or sometimes zero!) cpu time compared to their peers. You can make this more obvious by nice'ing the CFS load up as high as it will go, which will approximate 1/2 of the load of the softirq (since RT tasks previously enjoyed a 2*MAX_SCHED_OTHER_LOAD rating.

I have observed this phenomenon (and its fix) while looking at things like network intensive workloads. I'm sure there are plenty of others that could cause similar ripples.

The fact is, the scheduler treats "load" to mean certain things which simply did not apply to RT tasks. As you know very well im sure ;), "load" is a metric which expresses the share of the cpu that will be consumed and this is used by the load balancer to make its decisions. However, you can put whatever rating you want on an RT task and it would always be irrelevant. RT tasks run as frequently and as long as they want (w.r.t. SCHED_OTHER) independent of what their load rating implies to the balancer, so you cannot make an accurate assessment of the true "available shares". This is why the load-balancer would become confused and fail to see true imbalance in a mixed environment. Fixing this, as Peter has attempted to do, will result in a much better distribution of SCHED_OTHER tasks across the true available bandwidth, and thus improve overall performance.

In previous discussions with people, I had always used a metaphor of a stream. A system running SCHED_OTHER tasks is like a smooth running stream, but dispatching an RT task (or an IRQ, even) is like throwing a boulder into the water. It makes a big disruptive splash and causes turbulent white water behind it. And the stream has no influence over the size of the boulder, its placement in the stream, nor how long it will be staying.

This fix (at least in concept) allows it to become more like gently slipping a streamlined aerodynamic object into the water. The stream still cannot do anything about the size or placement of the object, but it can at least flow around it and smoothly adapt to the reduced volume of water that the stream can carry. :)

HTH
-Greg

Attachment: signature.asc
Description: OpenPGP digital signature