Re: Unbounded priority inversion while assigning tasks into cgroups.

From: Sebastian Andrzej Siewior
Date: Wed Oct 27 2021 - 12:58:10 EST


On 2021-10-25 11:43:52 [+0200], Ronny Meeus wrote:
> Hello
Hi,

> an unbounded priority inversion is observed when moving tasks into cgroups.
> In my case I'm using the cpu and cpuacct cgroups but the issue is
> independent of this.
>
> Kernel version: 4.9.79
> CPU: Dual core Cavium Octeon (MIPS)
> Kernel configured with CONFIG_PREEMPT=y
>
> I have a small application running at RT priority 92.
> Its job is to move high CPU consuming applications into a cgroup when
> the system is under high load.
> Under extreme load conditions (meaning a lot of script processing
> (process clone / exec / exit) and high application load), sometimes
> the application hangs for a long time (can be a couple of seconds but
> also hangs of 2 minutes are observed already).
>
> Extending the kernel with traces (see below) showed that the
> root-cause of the blocking is the global rwsem
> "cgroup_threadgroup_rwsem".
> While adding a task into the cgroup (__cgroup_procs_write), the write
> lock is taken which will have to wait until all writers and readers
> have completed their critical section which can take very long.
> Especially since there are many of them running at a much lower
> priority and we have also applications running at medium priority
> running with a very high load.
>
> As an initial attempt I tried applying the RT patch but this does not
> resolve the issue.
>
> The second attempt was to replace the cgroup_threadgroup_rwsem by a
> rt_mutex (which offers priority inheritance).
> After this change the issue seems to be resolved.
> A disadvantage of this approach is that all accesses to the critical
> section are serialized on all cores (writes to assign tasks to cgroups
> and reads to create/exec/exit processes).
>
> For the moment I do not see any other alternative to resolve this problem.
> Any advice on the right way forward would be appreciated.

>From a looking at percpu_rw_semaphore implementation, no new readers are
allowed as long as there is a writer pending. The writer has
(unfortunately) to wait until all readers are out. But then I doubt that
this takes up to two minutes for all existing readers to leave the
critical section.
Looking at v4.9.84, at least the RT implementation of rw_semaphore
allows new readers if a writer is pending. So this could be culprit as
you would have to wait until all reader are gone and the writer needs to
grab the lock before another reader shows up. But then this shouldn't be
the case for the generic implementation and new reader should wait until
the writer got its chance.

> Best regards,
> Ronny

Sebastian