Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent

From: Reinette Chatre
Date: Sun Apr 07 2024 - 15:13:23 EST


Hi Peter,

On 4/5/2024 2:30 PM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
> <reinette.chatre@xxxxxxxxx> wrote:
>> On 3/25/2024 10:27 AM, Peter Newman wrote:
>>> I've been working with users of the recently-added mongroup rename
>>> operation[1] who have observed the impact of back-to-back operations on
>>> latency-sensitive, thread pool-based services. Because changing a
>>> resctrl group's CLOSID (or RMID) requires all task_structs in the system
>>> to be inspected with the tasklist_lock read-locked, a series of move
>>> operations can block out thread creation for long periods of time, as
>>> creating threads needs to write-lock the tasklist_lock.
>>
>> Could you please give some insight into the delays experienced? "long
>> periods of time" mean different things to different people and this
>> series seems to get more ominous as is progresses with the cover letter
>> starting with "long periods of time" and by the time the final patch
>> appears it has become "disastrous".
>
> There was an incident where 99.999p tail latencies of a service
> increased from 100 milliseconds to over 2 seconds when the container
> manager switched from our legacy downstream CLOSID-reuse technique[1]
> to mongrp_rename().
>
> A more focused study benchmarked creating 128 threads with
> pthread_create() on a production host found that moving mongroups
> unrelated to any of the benchmark threads increased the completion cpu
> time from 30ms to 183ms. Profiling the contention on the tasklist_lock
> showed that the average contention time on the tasklist_lock was about
> 70ms when mongroup move operations were taking place.
>
> It's difficult for me to access real production workloads, but I
> estimated a crude figure by measuring the task time of "wc -l
> /sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
> Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
> converged to a line where every additional 1000 threads added about 1
> millisecond.

Thank you very much for capturing this. Could you please include this in
next posting? This data motivates this work significantly more than
terms that are not measurable.

> Incorporating kernfs_rename() into the solution for changing a group's
> class of service also contributes a lot of overhead (about 90% of a
> mongroup rename seems to be spent here), but the global impact is far
> less than that of the tasklist_lock contention.

Is the kernfs_rename() overhead in an acceptable range?

>>> context switch. Updating a group's ID then only requires the current
>>> task to be switched back in on all CPUs. On server hosts with very large
>>> thread counts, this is much less disruptive than making thread creation
>>> globally unavailable. However, this is less desirable on CPU-isolated,
>>> realtime workloads, so I am interested in suggestions on how to reach a
>>> compromise for the two use cases.
>>
>> As I understand this only impacts moving a monitor group? To me this sounds
>> like a user space triggered event associated with a particular use case that
>> may not be relevant to the workloads that you refer to. I think this could be
>> something that can be documented for users with this type of workloads.
>> (please see patch #6)
>
> All of the existing rmdir cases seem to have the same problem, but
> they must not be used frequently enough for any concerns to be raised.
>
> It seems that it's routine for the workload of hosts to be increased
> until memory bandwidth saturation, so applying and unapplying
> allocation restrictions happens rather frequently.

Thank you.

Reinette