[PATCH 2/3] task: RCU protect tasks on the runqueue

From: Eric W. Biederman
Date: Tue Sep 03 2019 - 00:52:28 EST



In the ordinary case today the rcu grace period of a task comes when a
task is reaped, well after the task has left the runqueue. This
change guarantees that the rcu grace period always happens after a
task has left the runqueue. As this is something that usaually happens
today I do not expect any code correctness problems with this change.
At most I anticipate timing challenges.

The only code that will run later are the functions
perf_event_delayed_put, and trace-sched_process_free. The function
perf_event_delayed_put in the final analysis is just a WARN_ON for
cases that I assume should never happen. So I don't see any problem
with delaying it.

The function trace_sched_process_free is a trace point and thus user
space visible. The strangest dependencies can happen but short
of the bizarre it appears to me that trace_sched_process_free is
getting a slightly more accurate picture of when a task struct
is free. As it is now guaranteed that the process will not be
on the runqueue.

Resources for a process are freed in release_task or in __put_task_struct
when the reference count goes to 0. Both of which are happening at
effectively the same time as before. The rcu grace period is just
potentially happening a little bit later in the timeline.

In the common case of a process being reaped after it leaves the
runqueue everything will happen exactly as before.

In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed. So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload. So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Inspired-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
---
kernel/fork.c | 11 +++++++----
kernel/sched/core.c | 7 ++++---
2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (orig->cpus_ptr == &orig->cpus_mask)
tsk->cpus_ptr = &tsk->cpus_mask;

- /* One for the user space visible state that goes away when reaped. */
- refcount_set(&tsk->rcu_users, 1);
- /* One for the rcu users, and one for the scheduler */
- refcount_set(&tsk->usage, 2);
+ /*
+ * One for the user space visible state that goes away when reaped.
+ * One for the scheduler.
+ */
+ refcount_set(&tsk->rcu_users, 2);
+ /* One for the rcu users */
+ refcount_set(&tsk->usage, 1);
#ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..802958407369 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);

- put_task_struct(prev);
+ put_task_struct_rcu_user(prev);
}

tick_nohz_task_switch();
@@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)

if (likely(prev != next)) {
rq->nr_switches++;
- rq->curr = next;
+ rcu_assign_pointer(rq->curr, next);
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -5863,7 +5863,8 @@ void init_idle(struct task_struct *idle, int cpu)
__set_task_cpu(idle, cpu);
rcu_read_unlock();

- rq->curr = rq->idle = idle;
+ rq->idle = idle;
+ rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
idle->on_cpu = 1;
--
2.21.0.dirty