Re: 3.14-rc2 XFS backtrace because irqs_disabled.

From: Dave Chinner
Date: Wed Feb 12 2014 - 02:18:41 EST


On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
>
> That would also explain why it shows up for do_coredump(), even though
> clearly interrupts are not disabled at that point. It's just because
> do_coredump() opens a filename at a deeper point in the stack than the
> more normal system call paths do.

Right, it's exactly the same problem we have when knfsd is making
the VFS calls.

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

Yup. But it's when you see this sort of thing you realise that
almost no GFP_KERNEL memory allocation is really safe from stack
overruns, though:

0) 6064 64 update_group_power+0x2c/0x270
1) 6000 384 find_busiest_group+0x10a/0x8b0
2) 5616 288 load_balance+0x165/0x870
3) 5328 96 idle_balance+0x106/0x1b0
4) 5232 112 __schedule+0x7b6/0x7e0
5) 5120 16 schedule+0x29/0x70
6) 5104 176 percpu_ida_alloc+0x1b3/0x3d0
7) 4928 32 blk_mq_wait_for_tags+0x1f/0x40
8) 4896 80 blk_mq_alloc_request_pinned+0x4e/0x110
9) 4816 128 blk_mq_make_request+0x42b/0x510
10) 4688 48 generic_make_request+0xc0/0x110
11) 4640 96 submit_bio+0x71/0x120
12) 4544 192 _xfs_buf_ioapply+0x2cc/0x3b0
13) 4352 48 xfs_buf_iorequest+0x6f/0xc0
14) 4304 32 xlog_bdstrat+0x23/0x60
15) 4272 96 xlog_sync+0x3a4/0x5c0
16) 4176 48 xlog_state_release_iclog+0x121/0x130
17) 4128 192 xlog_write+0x700/0x7c0
18) 3936 192 xlog_cil_push+0x2ae/0x3d0
19) 3744 128 xlog_cil_force_lsn+0x1b8/0x1e0
20) 3616 144 _xfs_log_force_lsn+0x7c/0x300
21) 3472 48 xfs_log_force_lsn+0x3b/0xa0
22) 3424 112 xfs_iunpin_wait+0xd7/0x160
23) 3312 80 xfs_reclaim_inode+0xd0/0x350
24) 3232 432 xfs_reclaim_inodes_ag+0x277/0x3d0
25) 2800 48 xfs_reclaim_inodes_nr+0x33/0x40
26) 2752 16 xfs_fs_free_cached_objects+0x15/0x20
27) 2736 80 super_cache_scan+0x169/0x170
28) 2656 160 shrink_slab_node+0x133/0x290
29) 2496 80 shrink_slab+0x122/0x160
30) 2416 112 do_try_to_free_pages+0x1de/0x360
31) 2304 192 try_to_free_pages+0x110/0x190
32) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0
33) 1856 80 alloc_pages_current+0xb2/0x170
34) 1776 64 new_slab+0x265/0x2e0
35) 1712 240 __slab_alloc+0x2fb/0x4c4
36) 1472 80 kmem_cache_alloc+0x10b/0x140

That's almost 4700 bytes of stack usage from the
kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
Yes, it's an extreme case, but if you're looking for a smoking
gun.... :/

I can fix this one easily - we already have a workqueue for doing
async log pushes (will split the stack between xlog_cil_force_lsn
and xlog_cil_push), but the reason we haven't used it for synchronous
log forces is that screws up fsync performance on CFQ. We don't
recommend CFQ with XFS anyway, so I think I'll make this change
anyway.

> And some of the debug options that I'm sure DaveJ has enabled tend to
> make the stack usage worse (simply because they make a lot of data
> structures bigger).

True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of
XFS, but realistically 5% is not significant especially as we've
been hitting stack overflows with XFS on production systems
regularly enough on x86-64 over the past 2-3 years that we know what
"typical symptoms" of such overflows are...

The problem we have now is that everything outside XFS continues to
grow in stack usage, so the only choice that remains for us to avoid
overruns is to add performance impacting stack switches into the
middle of common XFS paths. We also have to do it unconditionally
because we don't know ahead of time if any specific operation is
going to require all the stack we can give it because - for example
- we can't predict when the IO path is going to require memory
allocation...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/