Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
From: Valentin Schneider
Date: Fri Sep 06 2019 - 18:51:06 EST
On 06/09/2019 18:10, Parth Shah wrote:
> Right, CPU capacity can solve the problem of indicating the thermal throttle to the scheduler.
> AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom of the CPU.
> This is a nice mitigation but,
> 1. Sometimes a single task is responsible for the Thermal heatup of the core, reducing the
> CPU capacity of all the CPUs in the core is not optimal when just moving such single
> task to other core can allow us to remain within thermal headroom. This is important
> for the servers especially where there are upto 8 threads.> 2. Given the implementation in the patches and its integration with EAS, it seems difficult
> to adapt to servers, where CPU capacity itself is in doubt.
I'd nuance this to *SMT* capacity (which isn't just servers). The thing is
that it's difficult to come up with a sensible scheme to describe the base
capacity of a single logical CPU. But yeah, valid point.
>> For active balance, we actually already have a condition that moves a task
>> to a less capacity-pressured CPU (although it is somewhat specific). So if
>> thermal pressure follows that task (e.g. it's doing tons of vector/float),
>> it will be rotated around.
> Agree. But this should break in certain conditions like when we have multiple tasks
> in a core with almost equal utilization among which one is just doing vector operations.
> LB can pick and move any task with equal probability if the capacity is reduced here.
Right, if/when we get things like per-unit signals (wasn't there something
about tracking AVX a few months back?) then we'll be able to make
more informed decisions, for now we'll need some handholding (read: task
>> However there should be a point made on latency vs throughput. If you
>> care about latency you probably do not want to active balance your task. If
> Can you please elaborate on why not to consider active balance for latency sensitive tasks?
> Because, sometimes finding a thermally cool core is beneficial when Turbo frequency
> range is around 20% above rated ones.
This goes back to my reply to Patrick further up the thread.
Right now active balance can happen just because we've been imbalanced for
some time and repeatedly failed to migrate anything. After 3 (IIRC) successive
failed attempts, we'll active balance the running task of the remote rq we
decided was busiest.
If that happens to be a latency sensitive task, that's not great - active
balancing means stopping that task's execution, so we're going to add some
latency to this latency-sensitive task. My proposal was to further ratelimit
active balance (e.g. require more failed attempts) when the task that would be
preempted is latency-sensitive.
My point is: if that task is doing fine where it is, why preempt it? That's
just introducing latency IMO (keeping in mind that those balance attempts
could happen despite not having any thermal pressure).
If you care about performance (e.g. a minimum level of throughput), to me
that is a separate (though perhaps not entirely distinct) property.
>> you care about throughput, it should be specified in some way (util-clamp
>> says hello!).
> yes I do care for latency and throughput both. :-)
Don't we all!
> but I'm wondering how uclamp can solve the problem for throughput.
> If I make the thermally hot tasks to appear bigger than other tasks then reducing
> CPU capacity can allow such tasks to move around the chip.
> But this will require the utilization value to be relatively large compared to the other
> tasks in the core. Or other task's uclamp.max can be lowered to make such task rotate.
> If I got it right, then this will be a difficult UCLAMP usecase from user perspective, right?
> I feel like I'm missing something here.
Hmm perhaps I was jumping the gun here. What I was getting to is if you have
something like misfit that migrates tasks to CPUs of higher capacity than the
one they are on, you could use uclamp to flag them.
You could translate your throughput requirement as a uclamp.min of e.g. 80%,
and if the CPU capacity goes below that (or close within a margin) then you'd
try to migrate the task to a CPU of higher capacity (i.e. not or less
This doesn't have to involve your less throughput-sensitive tasks, since you
would only tag and take action for your throughput-sensitive tasks.