Re: mm: shmem: NULL ptr deref in shmem_fault

From: Hugh Dickins
Date: Tue May 13 2014 - 18:21:57 EST


Adding Peter and Ingo from the perf_even_mmap thread, and Kirill.

On Mon, 12 May 2014, Sasha Levin wrote:
> On 05/12/2014 05:12 PM, Andrew Morton wrote:
> > On Mon, 12 May 2014 10:26:17 -0400 Sasha Levin <sasha.levin@xxxxxxxxxx> wrote:
> >
> >> Hi all,
> >>
> >> While fuzzing with trinity inside a KVM tools guest running the latest -next
> >> kernel I've stumbled on the following spew.
> >>
> >> It seems that in this case, 'inode->i_mapping' was NULL, and the deref happened
> >> when we tried to get it's flags in mapping_gfp_mask().

Not quite, I think it's just before that: it's got a NULL inode out of
vma->vm_file->f_inode, and is now trying to use that to get i_mapping.

> >>
> >> [ 4431.615828] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
> >> [ 4431.617708] IP: shmem_fault (mm/shmem.c:2960 mm/shmem.c:1236)
> >> [ 4431.621711] PGD 1d7fb5067 PUD 1d7fb4067 PMD 0
> >> [ 4431.621945] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >> [ 4431.621945] Dumping ftrace buffer:
> >> [ 4431.621945] (ftrace buffer empty)
> >> [ 4431.621945] Modules linked in:
> >> [ 4431.621945] CPU: 1 PID: 20571 Comm: trinity-c61 Not tainted 3.15.0-rc5-next-20140512-sasha-00019-ga20bc00-dirty #456
> >> [ 4431.621945] task: ffff8803333e0000 ti: ffff88032b562000 task.ti: ffff88032b562000
> >> [ 4431.621945] RIP: shmem_fault (mm/shmem.c:2960 mm/shmem.c:1236)
> >> [ 4431.621945] RSP: 0018:ffff88032b5639b8 EFLAGS: 00010296
> >> [ 4431.621945] RAX: ffff88005daab100 RBX: ffff88005db19e00 RCX: 0000000000000001
> >> [ 4431.621945] RDX: 0000000000000001 RSI: ffff88032b5639f8 RDI: 0000000000000000
> >> [ 4431.621945] RBP: ffff88032b5639d8 R08: ffff88032b563a70 R09: ffff88032b5639c4
> >> [ 4431.621945] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801caad2ed0
> >> [ 4431.621945] R13: ffff8801ca27c7e8 R14: 0000000000000000 R15: ffff88005db19e00
> >> [ 4431.621945] FS: 00007f8c3b4ef700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
> >> [ 4431.621945] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [ 4431.621945] CR2: 0000000000000030 CR3: 00000001d7fb6000 CR4: 00000000000006a0
> >> [ 4431.621945] DR0: 00000000006df000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 4431.621945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> >> [ 4431.621945] Stack:
> >> [ 4431.621945] ffffffff8a2bb583 0000020000000286 ffff88032b563a18 ffff88032b563a70
> >> [ 4431.621945] ffff88032b563a38 ffffffff8a2b83d9 ffff8801e71f8338 00000000ffffffff
> >> [ 4431.621945] ffff880300000000 0000000000000001 00007f8c3b4fd000 0000000000000000
> >> [ 4431.621945] Call Trace:
> >> [ 4431.621945] ? do_read_fault.isra.40 (mm/memory.c:2882)
> >> [ 4431.621945] __do_fault (mm/memory.c:2703)
> >> [ 4431.621945] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >> [ 4431.621945] do_read_fault.isra.40 (mm/memory.c:2883)
> >> [ 4431.621945] ? get_parent_ip (kernel/sched/core.c:2519)
> >> [ 4431.621945] ? get_parent_ip (kernel/sched/core.c:2519)
> >> [ 4431.621945] __handle_mm_fault (mm/memory.c:3021 mm/memory.c:3182 mm/memory.c:3306)
> >> [ 4431.621945] ? __const_udelay (arch/x86/lib/delay.c:126)
> >> [ 4431.621945] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >> [ 4431.621945] handle_mm_fault (mm/memory.c:3329)
> >> [ 4431.621945] __get_user_pages (mm/gup.c:281 mm/gup.c:466)
> >> [ 4431.621945] ? preempt_count_sub (kernel/sched/core.c:2575)
> >> [ 4431.621945] get_user_pages (mm/gup.c:632)
> >> [ 4431.621945] get_user_pages_fast (arch/x86/mm/gup.c:394)
> >> [ 4431.621945] vmsplice_to_pipe (fs/splice.c:1487 fs/splice.c:1607)
> >> [ 4431.621945] ? page_cache_pipe_buf_release (fs/splice.c:267)
> >> [ 4431.621945] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> >> [ 4431.621945] ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
> >> [ 4431.621945] ? sched_clock_local (kernel/sched/clock.c:214)
> >> [ 4431.621945] ? vtime_account_user (kernel/sched/cputime.c:687)
> >> [ 4431.621945] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> >> [ 4431.621945] ? put_lock_stats.isra.12 (arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> >> [ 4431.621945] ? vtime_account_user (kernel/sched/cputime.c:687)
> >> [ 4431.621945] ? get_parent_ip (kernel/sched/core.c:2519)
> >> [ 4431.621945] ? get_parent_ip (kernel/sched/core.c:2519)
> >> [ 4431.621945] ? __fget_light (include/linux/rcupdate.h:428 include/linux/fdtable.h:80 fs/file.c:684)
> >> [ 4431.621945] SyS_vmsplice (fs/splice.c:1650 fs/splice.c:1636)
> >> [ 4431.621945] tracesys (arch/x86/kernel/entry_64.S:746)
> >> [ 4431.621945] Code: 66 66 66 90 55 b9 01 00 00 00 48 89 e5 53 48 89 fb 4c 8d 4d ec 48 83 ec 18 c7 45 ec 00 02 00 00 48 8b 87 a0 00 00 00 48 8b 78 20 <48> 8b 57 30 4c 8b 82 48 01 00 00 48 8d 56 18 48 8b 76 08 41 81
> >> [ 4431.621945] RIP shmem_fault (mm/shmem.c:2960 mm/shmem.c:1236)
> >> [ 4431.621945] RSP <ffff88032b5639b8>
> >> [ 4431.621945] CR2: 0000000000000030
> >
> > OK, how the heck can we get all the way to shmem_fault and then find an
> > inode with no ->mapping? A race, presumably.
> >
> > viro has been mucking with the splice code, but I don't see how that
> > can affect things down here.
> >
> > Stumped.
> >

And the second dump you sent was kernel paging request at ffffffffffffff48
in mpol_shared_policy_lookup(). I don't have your exact config, but I
expect that if you disassemble your shmem_get_policy(), you'll find it
subtracting 0xb8 from inode pointer to locate &SHMEM_I(inode)->policy.

Not that I'll care if it's not exactly 0xb8: we have no particular
reason to expect NULL f_inode once struct file has been freed.

>
> There seems to be a race issue with files going away unexpectedly (ignore
> me blaming perf):
>
> https://lkml.org/lkml/2014/5/12/514
>
> Can it possibly be the same bug?

I haven't delved into the perf_even_mmap d_path (fs/dcache.c:2947) one,
but the Sys_mremap one on file->f_op->f_unmapped_area sounds like what
we have here: struct file has been freed.

I believe Al is innocent: I point a quivering finger at... Kirill.

Just guessing, but we know how fond trinity is of remap_file_pages(),
and comparing old and new emulations shows that interesting

struct file *file = get_file(vma->vm_file);
addr = mmap_region(...);
fput(file);

in mm/fremap.c's old emulation, but no get_file() and fput() around
the do_mmap_pgoff() in mm/mmap.c's new emulation.

Before it puts in the new, do_mmap_pgoff() might unmap the last reference
to vma->vm_file, so emulation needs to take its own reference. I'm not
sure how that plays out nowadays with Al's deferred fput, but it does
look suspicious to me.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/