Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

From: Peter Zijlstra
Date: Tue Jul 15 2014 - 12:37:37 EST

Next message: Kumar Gala: "[PATCH v7 1/2] phy: qcom: Add driver for QCOM IPQ806x SATA PHY"
Previous message: Kumar Gala: "Re: [PATCH v5 1/2] phy: qcom: Add driver for QCOM IPQ806x SATA PHY"
In reply to: Tejun Heo: "Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jul 15, 2014 at 11:21:49AM -0400, Tejun Heo wrote:
> On Tue, Jul 15, 2014 at 03:36:27PM +0200, Peter Zijlstra wrote:
> > So, just to expand on this, we're already getting 'bug' reports because
> > worker threads are not cgroup aware. If work gets generated inside some
> > cgroup, the worker doesn't care and runs the worker thread wherever
> > (typically the root cgroup).
> >
> > This means that the 'work' escapes the cgroup confines and creates
> > resource inversion etc. The same is of course true for nice and RT
> > priorities.
> >
> > TJ, are you aware of this and/or given it any throught?
>
> Yeap, I'm aware of the issue but haven't read any actual bug reports
> yet. Can you point me to the reports?

lkml.kernel.org/r/53A8EC1E.1060504@xxxxxxxxxxxxxxxxxx

The root level workqueue thingies disturb the cgroup level scheduling to
'some' extend.

That whole thread is somewhat confusing and I think there's more than
just this going on, but they're really seeing this as a pain point.

> Given that worker pool management is dynamic, spawning separate pools
> for individual cgroups on-demand should be doable. Haven't been able
> to decide how much we should be willing to pay in terms of complexity
> yet.

Yah, I figured. Back before you ripped up the workqueue I had a
worklet-PI patch in -rt, which basically sorted and ran works in a
RR/FIFO priority order, including boosting the current work when a
higher prio one was pending etc.

I never really figured out a way to make the new concurrent stuff do
something like that, and this 'problem' here is harder still, because
they're not static prios etc.

Ideally we'd run the works _in_ the same task-context (from a scheduler
POV) as the task creating the work. There's some very obvious problems
of implementation there, and some less obvious others, so bleh.

Also, there's the whole softirq trainwreck, which has many of the same
problems. Much of the network stack isn't necessarily aware for whom
they're doing work, so no way to propagate.

Point in case for the crypto stuff I suppose, that's a combination of
the two, god only knows who we should be accounting it to and in what
context things should run.

Ideally a socket has a 'single' (ha! if only) owner, and we'd know
throughout the entirely rx/tx paths, but I doubt we actually have that.

(Note that there's people really suffering because of this..)

Same for the 'shiny' block-mq stuff I suppose :-(

Attachment: pgpMbDTF7Cmxa.pgp
Description: PGP signature

Next message: Kumar Gala: "[PATCH v7 1/2] phy: qcom: Add driver for QCOM IPQ806x SATA PHY"
Previous message: Kumar Gala: "Re: [PATCH v5 1/2] phy: qcom: Add driver for QCOM IPQ806x SATA PHY"
In reply to: Tejun Heo: "Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]