Re: [BUG] racy access to p->mm in membarrier_global_expedited()

From: Paul E. McKenney
Date: Mon Jan 28 2019 - 09:16:06 EST


On Fri, Jan 25, 2019 at 06:26:47PM +0100, Jann Horn wrote:
> membarrier_global_expedited() runs the following code (introduced in
> commit c5f58bd58f43), protected only by an RCU read-side critical
> section and the cpu_hotplug_lock:
>
> p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> if (!fallback)
> __cpumask_set_cpu(cpu, tmpmask);
> else
> smp_call_function_single(cpu, ipi_mb, NULL, 1);
> }
>
> p->mm is not protected by either lock. This means that in theory, the
> following races could occur:
>
> 1. If the compiler emitted two separate reads of ->mm, the second read
> of p->mm could return a NULL pointer and crash.
> 2. If the mm is deallocated directly before the atomic_read() occurs,
> the atomic_read() could access a freed pointer (I think?).
>
> Neither of these are particularly likely - looking at the assembly of
> a normal build, the first race doesn't exist because the compiler
> optimizes the second read away, and the second race isn't going to
> cause anything particularly interesting. Still, this should probably
> be fixed...
>
> As far as I can tell, you'll have to either take the task_lock()
> around the "p->mm && (atomic_read(&p->mm->membarrier_state)" or add
> RCU to the lifetime of mm_struct. I'm not entirely sure what the
> better fix is... probably task_lock() makes more sense?

Ouch!!!

Acquiring task_lock() would work, but would be a global lock.
This could be addressed to some extent by batching concurrent
membarrier_global_expedited() invocations, so that one call to
membarrier_global_expedited() does the job for the set of concurrent
calls. The usual approach would use a counter, a pair of wait queues,
and a kthread.

I must defer to the mm guys on adding RCU to the lifetime of mm_struct.

Another approach would be to put the MEMBARRIER_STATE_GLOBAL_EXPEDITED
in the task structure. Yet another approach would be to acquire the
runqueue lock, thus preventing the task from switching away -- except
that it might be in the middle of exit(), so never mind.

Other approaches?

Thanx, Paul

> To test the bug, I patched an extra delay into the code:
>
> ====================
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 3cd8a3a795d2..69cc52039576 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -14,6 +14,7 @@
> * GNU General Public License for more details.
> */
> #include "sched.h"
> +#include <linux/delay.h>
>
> /*
> * Bitmask made from a "or" of all commands within enum membarrier_cmd,
> @@ -81,7 +82,7 @@ static int membarrier_global_expedited(void)
>
> rcu_read_lock();
> p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> - if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> + if (p && p->mm && (mdelay(100), 1) &&
> (atomic_read(&p->mm->membarrier_state) &
> MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> if (!fallback)
> __cpumask_set_cpu(cpu, tmpmask);
> ====================
>
> On a kernel with that patch applied, I ran this test code:
>
> ====================
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <stdio.h>
> #include <linux/membarrier.h>
> #include <err.h>
>
> int main(void) {
> while (1) {
> printf("executing global expedited barrier...\n");
> int res = syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL_EXPEDITED, 0);
> if (res) err(1, "barrier");
> }
> }
> ====================
>
> That resulted in this splat:
>
> [ 212.697681] ==================================================================
> [ 212.700582] BUG: KASAN: null-ptr-deref in
> membarrier_global_expedited+0x15f/0x220
> [ 212.703346] Read of size 4 at addr 0000000000000378 by task barrier/1177
>
> [ 212.706384] CPU: 1 PID: 1177 Comm: barrier Not tainted 5.0.0-rc3+ #246
> [ 212.708925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014
> [ 212.712263] Call Trace:
> [ 212.713177] dump_stack+0x71/0xab
> [ 212.714375] ? membarrier_global_expedited+0x15f/0x220
> [ 212.716236] ? membarrier_global_expedited+0x15f/0x220
> [ 212.718099] kasan_report+0x176/0x192
> [ 212.719445] ? finish_task_switch+0x340/0x3d0
> [ 212.721057] ? membarrier_global_expedited+0x15f/0x220
> [ 212.722921] membarrier_global_expedited+0x15f/0x220
> [ 212.724696] ? ipi_mb+0x10/0x10
> [ 212.725816] ? vfs_write+0x120/0x230
> [ 212.727113] ? __ia32_sys_read+0x50/0x50
> [ 212.728596] __x64_sys_membarrier+0x85/0xf0
> [ 212.730056] do_syscall_64+0x73/0x160
> [ 212.731428] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 212.733236] RIP: 0033:0x7fbe8747e229
> [ 212.734540] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3f 4c 2b 00 f7 d8 64 89
> 01 48
> [ 212.741109] RSP: 002b:00007fffcb62a7c8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000144
> [ 212.743831] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe8747e229
> [ 212.746335] RDX: 00007fbe87475730 RSI: 0000000000000000 RDI: 0000000000000002
> [ 212.748855] RBP: 00007fffcb62a7e0 R08: 00007fffcb62a8c0 R09: 00007fffcb62a8c0
> [ 212.751374] R10: 00007fbe8793c700 R11: 0000000000000202 R12: 0000563ee2ac9610
> [ 212.753842] R13: 00007fffcb62a8c0 R14: 0000000000000000 R15: 0000000000000000
> [ 212.756305] ==================================================================
>