Re: [PATCH 3/3] introduce task_rcu_dereference()

From: Peter Zijlstra
Date: Wed May 18 2016 - 15:10:56 EST


On Wed, May 18, 2016 at 08:23:18PM +0200, Oleg Nesterov wrote:
> IOW. We can never know if we have a garbage in "sighand" or the real value,
> this task_struct can be freed/reallocated when we do probe_slab_address().
>
> And this is fine. We re-check that "task == *ptask" after that. Now we have
> two different cases:
>
> 1. This is actually the same task/task_struct. In this case
> sighand != NULL tells us it is still alive.
>
> 2. This is another task which got the same memory for task_struct.
> We can't know this of course, and we can not trust sighand != NULL.
>
> In this case we actually return a random value, but this is correct.
>
> If we return NULL - we can pretend that we actually noticed that
> *ptask was updated when the previous task has exited. Or pretend
> that probe_slab_address(&sighand) reads NULL.
>
> If we return the new task (because sighand is not NULL for any
> reason) - this is fine too. This (new) task can't go away before
> another gp pass.
>
> And please note again the "We could even eliminate the false positive"
> comment above (hmm, it should probably say false negative). We could
> re-read task->sighand once again to avoid the falsely NULL.
>
> But this case is very unlikely so I think we do not really care.
>

Ah right, lets stick that in.. :-)

OK, something like so then?

---
include/linux/sched.h | 3 ++
kernel/exit.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 29 +++++---------------
3 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45a22b9..7f90002e9344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2134,6 +2134,9 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
}

+struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern void task_cputime(struct task_struct *t,
cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..58d7e05821ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,6 +211,82 @@ void release_task(struct task_struct *p)
}

/*
+ * Note that if this function returns a valid task_struct pointer (!NULL)
+ * task->usage must remain >0 for the duration of the RCU critical section.
+ */
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+ struct sighand_struct *sighand;
+ struct task_struct *task;
+
+ /*
+ * We need to verify that release_task() was not called and thus
+ * delayed_put_task_struct() can't run and drop the last reference
+ * before rcu_read_unlock(). We check task->sighand != NULL,
+ * but we can read the already freed and reused memory.
+ */
+retry:
+ task = rcu_dereference(*ptask);
+ if (!task)
+ return NULL;
+
+ probe_kernel_address(&task->sighand, sighand);
+
+ /*
+ * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+ * was already freed we can not miss the preceding update of this
+ * pointer.
+ */
+ smp_rmb();
+ if (unlikely(task != READ_ONCE(*ptask)))
+ goto retry;
+
+ /*
+ * We've re-checked that "task == *ptask", now we have two different
+ * cases:
+ *
+ * 1. This is actually the same task/task_struct. In this case
+ * sighand != NULL tells us it is still alive.
+ *
+ * 2. This is another task which got the same memory for task_struct.
+ * We can't know this of course, and we can not trust
+ * sighand != NULL.
+ *
+ * In this case we actually return a random value, but this is
+ * correct.
+ *
+ * If we return NULL - we can pretend that we actually noticed that
+ * *ptask was updated when the previous task has exited. Or pretend
+ * that probe_slab_address(&sighand) reads NULL.
+ *
+ * If we return the new task (because sighand is not NULL for any
+ * reason) - this is fine too. This (new) task can't go away before
+ * another gp pass.
+ *
+ * And note: We could even eliminate the false positive if re-read
+ * task->sighand once again to avoid the falsely NULL. But this case
+ * is very unlikely so we don't care.
+ */
+ if (!sighand)
+ return NULL;
+
+ return task;
+}
+
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+ struct task_struct *task;
+
+ rcu_read_lock();
+ task = task_rcu_dereference(ptask);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ return task;
+}
+
+/*
* Determine if a process group is "orphaned", according to the POSIX
* definition in 2.2.2.52. Orphaned process groups are not to be affected
* by terminal-generated stop signals. Newly orphaned process groups are
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..1d3a410c481b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
int dist = env->dist;
bool assigned = false;

- rcu_read_lock();
-
- raw_spin_lock_irq(&dst_rq->lock);
- cur = dst_rq->curr;
- /*
- * No need to move the exiting task or idle task.
- */
- if ((cur->flags & PF_EXITING) || is_idle_task(cur))
- cur = NULL;
- else {
- /*
- * The task_struct must be protected here to protect the
- * p->numa_faults access in the task_weight since the
- * numa_faults could already be freed in the following path:
- * finish_task_switch()
- * --> put_task_struct()
- * --> __put_task_struct()
- * --> task_numa_free()
- */
- get_task_struct(cur);
+ cur = try_get_task_struct(&dst_rq->curr);
+ if (cur) {
+ if ((cur->flags & PF_EXITING) || is_idle_task(cur)) {
+ put_task_struct(cur);
+ cur = NULL;
+ }
}

- raw_spin_unlock_irq(&dst_rq->lock);
-
+ rcu_read_lock();
/*
* Because we have preemption enabled we can get migrated around and
* end try selecting ourselves (current == env->p) as a swap candidate.