Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

From: Juri Lelli
Date: Thu Feb 04 2016 - 07:03:45 EST


On 04/02/16 09:54, Juri Lelli wrote:
> Hi Steve,
>
> first of all thanks a lot for your detailed report, if only all bug
> reports were like this.. :)
>
> On 03/02/16 13:55, Steven Rostedt wrote:

[...]

>
> Right. I think this is the same thing that happens after hotplug. IIRC
> the code paths are actually the same. The problem is that hotplug or
> cpuset reconfiguration operations are destructive w.r.t. root_domains,
> so we lose bandwidth information when that happens. The problem is that
> we only store cumulative information regarding bandwidth in root_domain,
> while information about which task belongs to which cpuset is store in
> cpuset data structures.
>
> I tried to fix this a while back, but my tentative was broken, I failed
> to get locking right and, even though it seemed to fix the issue for me,
> it was prone to race conditions. You might still want to have a look at
> that for reference: https://lkml.org/lkml/2015/9/2/162
>

[...]

>
> It's good that we can recover, but that's still a bug yes :/.
>
> I'll try to see if my broken patch make what you are seeing apparently
> disappear, so that we can at least confirm that we are seeing the same
> problem; you could do the same if you want, I pushed that here
>

No it doesn't solve this :/. I placed restoring code in the hotplug
workfn, so updates generated by toggling sched_load_balance don't get
caught, of course. But, this at least tells us that we should solve this
someplace else.

Best,

- Juri