Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

From: Josh Don
Date: Wed Feb 24 2021 - 00:17:16 EST


On Tue, Feb 23, 2021 at 11:26 AM Chris Hyser <chris.hyser@xxxxxxxxxx> wrote:
>
> On 2/23/21 4:05 AM, Peter Zijlstra wrote:
> > On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
> >> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> >> While trying to test the new prctl() code I'm working on, I ran into a bug I
> >> chased back into this v10 code. Under a fair amount of stress, when the
> >> function __sched_core_update_cookie() is ultimately called from
> >> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
> >> I've not had much success figuring out why/what. I'm running with LOCKDEP on
> >> and seeing no complaints. Duplicating it only requires setting a cookie on a
> >> task and forking a bunch of threads ... all of which then want to update
> >> their cookie.
> >
> > Can you share the code and reproducer?
>
> Attached is a tarball with c code (source) and scripts. Just run ./setup_bug which will compile the source and start a
> bash with a cs cookie. Then run ./show_bug which dumps the cookie and then fires off some processes and threads. Note
> the cs_clone command is not doing any core sched prctls for this test (not needed and currently coded for a diff prctl
> interface). It just creates processes and threads. I see this hang almost instantly.
>
> Josh, I did verify that this occurs on Joel's coresched tree both with and w/o the kprot patch and that should exactly
> correspond to these patches.
>
> -chrish
>

I think I've gotten to the root of this. In the fork code, our cases
for inheriting task_cookie are inverted for CLONE_THREAD vs
!CLONE_THREAD. As a result, we are creating a new cookie per-thread,
rather than inheriting from the parent. Now this is actually ok; I'm
not observing a scalability problem with creating this many cookies.
However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs the
control. I've verified that both a) manually hardcoding all threads to
be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.