Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

From: Eric W. Biederman
Date: Fri Aug 30 2019 - 15:36:33 EST


Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>>
>>
>> Yes, please see
>>
>> [PATCH 2/3] introduce probe_slab_address()
>> https://lore.kernel.org/lkml/20141027195425.GC11736@xxxxxxxxxx/
>>
>> I sent 5 years ago ;) Do you think
>>
>> /*
>> * Same as probe_kernel_address(), but @addr must be the valid pointer
>> * to a slab object, potentially freed/reused/unmapped.
>> */
>> #ifdef CONFIG_DEBUG_PAGEALLOC
>> #define probe_slab_address(addr, retval) \
>> probe_kernel_address(addr, retval)
>> #else
>> #define probe_slab_address(addr, retval) \
>> ({ \
>> (retval) = *(typeof(retval) *)(addr); \
>> 0; \
>> })
>> #endif
>>
>> can work?
>
> Ugh. I would much rather handle the general case, because honestly,
> tracing has had a lot of issues with our hacky "probe_kernel_read()"
> stuff that bases itself on user addresses.
>
> It's also one of the few remaining users of "set_fs()" in core code,
> and we really should try to get rid of those.
>
> So your code would work for this particular case, but not for other
> cases that can trap simply because the pointer isn't reliable (tracing
> being the main case for that - but if the source of the pointer itself
> might have been free'd, you would also have that situation).
>
> So I'd really prefer to have something a bit fancier. On most
> architectures, doing a good exception fixup for kernel addresses is
> really not that hard.
>
> On x86, for example, we actually have *exactly* that. The
> "__get_user_asm()" macro is basically it. It purely does a load
> instruction from an unchecked address.
>
> (It's a really odd syntax, but you could remove the __chk_user_ptr()
> from the __get_user_size() macro, and now you'd have basically a "any
> regular size kernel access with exception handlng").
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock. We do that in other
> cases.

Sort of. The rcu delay happens when release_task calls
delayed_put_task_struct. Which unfortunately means that anytime after
exit_notify, release_task can operate on a task. So it is possible
that by the time do_dead_task is called the rcu grace period is up.


Which is the problem the users of task_rcu_dereference are fighting.
They are performing an rcu walk on the set of cups in task_numa_migrate
and in the userspace membarrier system calls.

For a short while we the rcu delay in put_task_struct but that required
changes all of the place and was just a pain to work with.

Then I did:
> commit 8c7904a00b06d2ee51149794b619e07369fcf9d4
> Author: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> Date: Fri Mar 31 02:31:37 2006 -0800
>
> [PATCH] task: RCU protect task->usage
>
> A big problem with rcu protected data structures that are also reference
> counted is that you must jump through several hoops to increase the reference
> count. I think someone finally implemented atomic_inc_not_zero(&count) to
> automate the common case. Unfortunately this means you must special case the
> rcu access case.
>
> When data structures are only visible via rcu in a manner that is not
> determined by the reference count on the object (i.e. tasks are visible until
> their zombies are reaped) there is a much simpler technique we can employ.
> Simply delaying the decrement of the reference count until the rcu interval is
> over.
>
> What that means is that the proc code that looks up a task and later
> wants to sleep can now do:
>
> rcu_read_lock();
> task = find_task_by_pid(some_pid);
> if (task) {
> get_task_struct(task);
> }
> rcu_read_unlock();
>
> The effect on the rest of the kernel is that put_task_struct becomes cheaper
> and immediate, and in the case where the task has been reaped it frees the
> task immediate instead of unnecessarily waiting an until the rcu interval is
> over.
>
> Cleanup of task_struct does not happen when its reference count drops to
> zero, instead cleanup happens when release_task is called. Tasks can only
> be looked up via rcu before release_task is called. All rcu protected
> members of task_struct are freed by release_task.
>
> Therefore we can move call_rcu from put_task_struct into release_task. And
> we can modify release_task to not immediately release the reference count
> but instead have it call put_task_struct from the function it gives to
> call_rcu.
>
> The end result:
>
> - get_task_struct is safe in an rcu context where we have just looked
> up the task.
>
> - put_task_struct() simplifies into its old pre rcu self.
>
> This reorganization also makes put_task_struct uncallable from modules as
> it is not exported but it does not appear to be called from any modules so
> this should not be an issue, and is trivially fixed.
>
> Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>

About a decade later task_struct grew some new rcu users and Oleg
introduced task_rcu_dereference to deal with them:

> commit 150593bf869393d10a79f6bd3df2585ecc20a9bb
> Author: Oleg Nesterov <oleg@xxxxxxxxxx>
> Date: Wed May 18 19:02:18 2016 +0200
>
> sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
>
> Generally task_struct is only protected by RCU if it was found on a
> RCU protected list (say, for_each_process() or find_task_by_vpid()).
>
> As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
> drops the (potentially) last reference without RCU gp, this means
> that we need to fix the code which uses foreign_rq->curr under
> rcu_read_lock().
>
> Add a new helper which can be used to dereference rq->curr or any
> other pointer to task_struct assuming that it should be cleared or
> updated before the final put_task_struct(). It returns non-NULL
> only if this task can't go away before rcu_read_unlock().
>
> ( Also add try_get_task_struct() to make it easier to use this API
> correctly. )

So I think it makes a lot of sense to change how we do this. Either
moving the rcu delay back into put_task_struct or doing halfway like
creating a put_dead_task_struct that will perform the rcu delay after
a task has been taken off the run queues and has stopped being a zombie.

Something like:
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..bf323418094e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
}

-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_dead_task_struct(struct task_struct *task);

#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..3a85bc2e8031 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
}

+void put_dead_task_struct(struct task_struct *task)
+{
+ bool delay = false;
+ unsigned long flags;
+
+ /* Is the task both reaped and no longer being scheduled? */
+ raw_spin_lock_irqsave(&task->pi_lock, flags);
+ if ((task->state == TASK_DEAD) &&
+ (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
+ delay = true;
+ raw_spin_lock_irqrestore(&task->pi_lock, flags);
+
+ /* If both are true use rcu delay the put_task_struct */
+ if (delay)
+ call_rcu(&task->rcu, delayed_put_task_struct);
+ else
+ put_task_struct(task);
+}

void release_task(struct task_struct *p)
{
@@ -222,76 +240,13 @@ void release_task(struct task_struct *p)

write_unlock_irq(&tasklist_lock);
release_thread(p);
- call_rcu(&p->rcu, delayed_put_task_struct);
+ put_dead_task_struct(p);

p = leader;
if (unlikely(zap_leader))
goto repeat;
}

-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
- struct sighand_struct *sighand;
- struct task_struct *task;
-
- /*
- * We need to verify that release_task() was not called and thus
- * delayed_put_task_struct() can't run and drop the last reference
- * before rcu_read_unlock(). We check task->sighand != NULL,
- * but we can read the already freed and reused memory.
- */
-retry:
- task = rcu_dereference(*ptask);
- if (!task)
- return NULL;
-
- probe_kernel_address(&task->sighand, sighand);
-
- /*
- * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
- * was already freed we can not miss the preceding update of this
- * pointer.
- */
- smp_rmb();
- if (unlikely(task != READ_ONCE(*ptask)))
- goto retry;
-
- /*
- * We've re-checked that "task == *ptask", now we have two different
- * cases:
- *
- * 1. This is actually the same task/task_struct. In this case
- * sighand != NULL tells us it is still alive.
- *
- * 2. This is another task which got the same memory for task_struct.
- * We can't know this of course, and we can not trust
- * sighand != NULL.
- *
- * In this case we actually return a random value, but this is
- * correct.
- *
- * If we return NULL - we can pretend that we actually noticed that
- * *ptask was updated when the previous task has exited. Or pretend
- * that probe_slab_address(&sighand) reads NULL.
- *
- * If we return the new task (because sighand is not NULL for any
- * reason) - this is fine too. This (new) task can't go away before
- * another gp pass.
- *
- * And note: We could even eliminate the false positive if re-read
- * task->sighand once again to avoid the falsely NULL. But this case
- * is very unlikely so we don't care.
- */
- if (!sighand)
- return NULL;
-
- return task;
-}
-
void rcuwait_wake_up(struct rcuwait *w)
{
struct task_struct *task;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..5b697c0572ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);

- put_task_struct(prev);
+ put_dead_task_struct(prev);
}

tick_nohz_task_switch();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..c3e1a302211a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
return;

rcu_read_lock();
- cur = task_rcu_dereference(&dst_rq->curr);
+ cur = rcu_dereference(&dst_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
cur = NULL;

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..74df8e0dfc84 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
continue;

rcu_read_lock();
- p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ p = rcu_dereference(&cpu_rq(cpu)->curr);
if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
if (cpu == raw_smp_processor_id())
continue;
rcu_read_lock();
- p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+ p = rcu_dereference(&cpu_rq(cpu)->curr);
if (p && p->mm == current->mm) {
if (!fallback)
__cpumask_set_cpu(cpu, tmpmask);



Eric