Re: [Discussion v2] Usecases for the per-task latency-nice attribute

From: Tim Chen
Date: Mon Oct 07 2019 - 13:06:38 EST


On 10/2/19 9:11 AM, David Laight wrote:
> From: Parth Shah
>> Sent: 30 September 2019 11:44
> ...
>> 5> Separating AVX512 tasks and latency sensitive tasks on separate cores
>> ( -Tim Chen )
>> ===========================================================================
>> Another usecase we are considering is to segregate those workload that will
>> pull down core cpu frequency (e.g. AVX512) from workload that are latency
>> sensitive. There are certain tasks that need to provide a fast response
>> time (latency sensitive) and they are best scheduled on cpu that has a
>> lighter load and not have other tasks running on the sibling cpu that could
>> pull down the cpu core frequency.
>>
>> Some users are running machine learning batch tasks with AVX512, and have
>> observed that these tasks affect the tasks needing a fast response. They
>> have to rely on manual CPU affinity to separate these tasks. With
>> appropriate latency hint on task, the scheduler can be taught to separate them.
>
> Has this been diagnosed properly?
> I can't really see how the frequency drop from AVX512 significantly affects latency.
> Most tasks that require low latency probably don't do a lot of work.
> It is much more likely that the latency issues happen because the AVX512 tasks
> are doing very few system calls so can't be pre-empted even by a high priority task.

This problem was conveyed to us by several customers. The issue is not
that you are slow to preempt an AVX512 task on the same logical cpu thread, but the AVX512
tasks on the sibling CPU thread is dropping the CPU frequency and lowering the performance and
response. Let's say that you make the latency sensitive task a real time task
with high priority so it will immediately run on a cpu after being woken.
But it will be slower if there's an AVX512 running on the sibling versus if other
kind of tasks are running on sibling.

This is the noisy neighbor effect. So it is better to isolate the latency
sensitive tasks on cores that AVX512 tasks don't run on.

Tim

> This 'feature' is hinted by this:
>> 2> TurboSched
>> ( -Parth Shah )
>> ====================
>> TurboSched [2] tries to minimize the number of active cores in a socket by
>> packing an un-important and low-utilization (named jitter) task on an
>> already active core and thus refrains from waking up of a new core if
>> possible.
>
> Consider this example of a process that requires low latency (sub 1ms would be good):
> - A hardware interrupt (or timer interrupt) wakes up on thread.
> - When that thread wakes it wakes up other threads that are sleeping.
> - All the threads 'beaver away' for a few ms (processing RTP and other audio).
> - They all sleep for the rest of a 10ms period.
>
> The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR.
> Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies.
> You really want the SCHED_RR threads to immediately pre-empt the running processes.
> But I suspect nothing happens until a timer interrupt to the target cpu.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>