Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr

From: Paul E. McKenney
Date: Sun Sep 15 2019 - 10:42:02 EST


On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
>
> The current task on the runqueue is currently read with rcu_dereference().
>
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr. Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
>
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler. So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
>
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
>
> As I write this there are effectively two users of rcu_dereference on
> rq->curr. There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference. Then there is
> task_numa_compare() in kernel/sched/fair.c. My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
>
> The code in __schedule() essentially does:
> rq_lock(...);
> smp_mb__after_spinlock();
>
> next = pick_next_task(...);
> rq->curr = next;
>
> context_switch(prev, next);
>
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier. Further there is a full memory barrier
> in context_switch().
>
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing. A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
>
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
>
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
>
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl". These fields are the sched_entity structures
> of the varies schedulers.

s/varies/various/ for whatever that is worth.

> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
>
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

The reasoning makes sense. I have not double-checked all the code.

> Link: https://lore.kernel.org/r/20190903200603.GW2349@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> ---
> kernel/sched/core.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> - rq->curr = next;
> + /*
> + * RCU users of rcu_dereference(rq->curr) may not see
> + * changes to task_struct made by pick_next_task().
> + */
> + RCU_INIT_POINTER(rq->curr, next);
> /*
> * The membarrier system call requires each architecture
> * to have a full memory barrier after updating
> @@ -5863,7 +5867,8 @@ void init_idle(struct task_struct *idle, int cpu)
> __set_task_cpu(idle, cpu);
> rcu_read_unlock();
>
> - rq->curr = rq->idle = idle;
> + rq->idle = idle;
> + rcu_assign_pointer(rq->curr, idle);
> idle->on_rq = TASK_ON_RQ_QUEUED;
> #ifdef CONFIG_SMP
> idle->on_cpu = 1;
> --
> 2.21.0.dirty

So this looks good in and of itself, but I still do not see what prevents
the unfortunate sequence of events called out in my previous email.
On the other hand, if ->rcu and ->rcu_users were not allocated on top
of each other by a union, I would be happy to provide a Reviewed-by.

And I am fundamentally distrusting of a refcount_dec_and_test() that
is immediately followed by code that clobbers the now-zero value.
Yes, this does have valid use cases, but it has a lot more invalid
use cases. The valid use cases have excluded all increments somehow
else, so that the refcount_dec_and_test() call's only job is to
synchronize between concurrent calls to put_task_struct_rcu_user().
But I am not seeing the "excluded all increments somehow".

So, what am I missing here?

Thanx, Paul