Re: [PATCH] sched: Better document ttwu()
From: Valentin Schneider
Date: Sat Jul 04 2020 - 20:22:02 EST
On 03/07/20 14:32, Peter Zijlstra wrote:
> Dave hit the problem fixed by commit:
>
> b6e13e85829f ("sched/core: Fix ttwu() race")
>
> and failed to understand much of the code involved. Per his request a
> few comments to (hopefully) clarify things.
>
> Requested-by: Dave Chinner <david@xxxxxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Small extra comment below, otherwise FWIW:
Reviewed-by: Valentin Schneider <valentin.schneider@xxxxxxx>
> +++ b/kernel/sched/core.c
> @@ -79,6 +79,100 @@ __read_mostly int scheduler_running;
> */
> int sysctl_sched_rt_runtime = 950000;
>
> +
> +/*
> + * Serialization rules:
> + *
> + * Lock order:
> + *
> + * p->pi_lock
> + * rq->lock
> + * hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)
> + *
> + * rq1->lock
> + * rq2->lock where: rq1 < rq2
> + *
> + * Regular state:
> + *
> + * Normal scheduling state is serialized by rq->lock. __schedule() takes the
> + * local CPU's rq->lock, it optionally removes the task from the runqueue and
> + * always looks at the local rq data structures to find the most elegible task
> + * to run next.
> + *
> + * Task enqueue is also under rq->lock, possibly taken from another CPU.
> + * Wakeups from another LLC domain might use an IPI to transfer the enqueue to
> + * the local CPU to avoid bouncing the runqueue state around [ see
> + * ttwu_queue_wakelist() ]
> + *
> + * Task wakeup, specifically wakeups that involve migration, are horribly
> + * complicated to avoid having to take two rq->locks.
> + *
> + * Special state:
> + *
> + * System-calls and anything external will use task_rq_lock() which acquires
> + * both p->pi_lock and rq->lock. As a consequence the state they change is
> + * stable while holding either lock:
> + *
> + * - sched_setaffinity()/
> + * set_cpus_allowed_ptr(): p->cpus_ptr, p->nr_cpus_allowed
> + * - set_user_nice(): p->se.load, p->*prio
> + * - __sched_setscheduler(): p->sched_class, p->policy, p->*prio,
> + * p->se.load, p->rt_priority,
> + * p->dl.dl_{runtime, deadline, period, flags, bw, density}
Uclamp stuff can also get updated in __sched_setscheduler(); see
__setscheduler_uclamp(). It's only p->uclamp_req AFAICT, but I don't think
there's harm in just saying p->uclamp*.
> + * - sched_setnuma(): p->numa_preferred_nid
> + * - sched_move_task()/
> + * cpu_cgroup_fork(): p->sched_task_group
> + * - uclamp_update_active() p->uclamp*
> + *