Re: 3.10-rc4 stalls during mmap writes

From: Dave Chinner
Date: Mon Jun 10 2013 - 16:53:30 EST


On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > > So I guess my question is does anyone know why I'm now seeing these
> > > stalls with 3.10?
> >
> > Because we made all metadata updates in XFS fully transactional in
> > 3.4:
> >
> > commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> > Author: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> > Date: Wed Feb 29 09:53:52 2012 +0000
> >
> > xfs: log timestamp updates
> >
> > Timestamps on regular files are the last metadata that XFS does not update
> > transactionally. Now that we use the delaylog mode exclusively and made
> > the log scode scale extremly well there is no need to bypass that code for
> > timestamp updates. Logging all updates allows to drop a lot of code, and
> > will allow for further performance improvements later on.
> >
> > Note that this patch drops optimized handling of fdatasync - it will be
> > added back in a separate commit.
> >
> > Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
> > Signed-off-by: Ben Myers <bpm@xxxxxxx>
> > $ git describe --contains 8a9c998
> > v3.4-rc1~55^2~23
> >
> > IOWs, you're just lucky you haven't noticed it on 3.4....
> >
> > > Are there any suggestions for how to eliminate them?
> >
> > Nope. You're stuck with it - there's far more places in the page
> > fault path where you can get stuck on the same lock for the same
> > reason - e.g. during block mapping for the newly added pagecache
> > page...
> >
> > Hint: mmap() does not provide -deterministic- low latency access to
> > mapped pages - it is only "mostly low latency".
>
> Hi Dave, I appreciate your time and analysis. I am sadly aware that
> doing file I/O and expecting low latency is a bit of a stretch. This
> is also not the first time I've battled these types of stalls and
> realize that the best I can do is search for opportunities to reduce
> the probability of a stall, or find ways to reduce the duration of a
> stall.
>
> In this case since updating timestamps has been transactional and in
> place since 3.4, it is obvious to me that this is not the cause of
> both the increased rate and duration of the stalls on 3.10. I assure
> you on 3.4 we have very few stalls that are greater than 10
> milliseconds in our normal workload and with 3.10 I'm seeing them
> regularly. Since we know I can, and likely do, hit the same code path
> on 3.4 that tells me that me that the xfs_ilock() is likely being held
> for a longer duration on the new kernel. Let's see if we can
> determine why the lock is held for so long.
>
> Here is an attempt at that, as events unfold in time order. I'm in no
> way a filesystem developer so any input or analysis of what we're
> waiting on is appreciated. There is also multiple kworker threads
> involved which certainly complicates things.
>
>
> It starts with kworker/u49:0 which aquires xfs_ilock() inside
> iomap_write_allocate().
>
>
> kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry: | xfs_iomap_write_allocate() {
> kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry: 0.074 us | xfs_ilock();
>
>
> In the next two chunks it appears that kworker/u49:0 calls
> xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
> __xfs_bmapi_allocate(). kworker/4:0 ends up blocking on xfs_buf_lock().
>
>
> kworker/4:0-27520 [004] 256032.180389: sched_switch: prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120
> kworker/4:0-27520 [004] 256032.180393: kernel_stack: <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => __down_common (ffffffff814c8e5e)
> => __down (ffffffff814c8f26)
> => down (ffffffff810658e1)
> => xfs_buf_lock (ffffffff811c12a4)
> => _xfs_buf_find (ffffffff811c1469)
> => xfs_buf_get_map (ffffffff811c16e4)
> => xfs_buf_read_map (ffffffff811c2691)
> => xfs_trans_read_buf_map (ffffffff81225fa9)
> => xfs_btree_read_buf_block.constprop.6 (ffffffff811f2242)
> => xfs_btree_lookup_get_block (ffffffff811f22fb)
> => xfs_btree_lookup (ffffffff811f6707)
> => xfs_alloc_ag_vextent_near (ffffffff811d9d52)
> => xfs_alloc_ag_vextent (ffffffff811da8b5)
> => xfs_alloc_vextent (ffffffff811db545)
> => xfs_bmap_btalloc (ffffffff811e6951)
> => xfs_bmap_alloc (ffffffff811e6dee)
> => __xfs_bmapi_allocate (ffffffff811ec024)
> => xfs_bmapi_allocate_worker (ffffffff811ec283)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

It has blocked waiting for freespace btree IO to complete....

> kworker/u49:0-15748 [004] 256032.180403: sched_switch: prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100
> kworker/u49:0-15748 [004] 256032.180408: kernel_stack: <stack trace>
> => schedule (ffffffff814ca379)
> => schedule_timeout (ffffffff814c810d)
> => wait_for_completion (ffffffff814ca2e5)
> => xfs_bmapi_allocate (ffffffff811eea57)
> => xfs_bmapi_write (ffffffff811eef33)
> => xfs_iomap_write_allocate (ffffffff811ce49e)
> => return_to_handler (ffffffff814d3845)
> => xfs_map_blocks (ffffffff811bf145)
> => xfs_vm_writepage (ffffffff811bfbc2)
> => __writepage (ffffffff810f14e7)
> => write_cache_pages (ffffffff810f189e)
> => generic_writepages (ffffffff810f1b3a)
> => xfs_vm_writepages (ffffffff811bef8d)
> => do_writepages (ffffffff810f3380)
> => __writeback_single_inode (ffffffff81166ae5)
> => writeback_sb_inodes (ffffffff81167d4d)
> => __writeback_inodes_wb (ffffffff8116800e)
> => wb_writeback (ffffffff811682bb)
> => wb_check_old_data_flush (ffffffff811683ff)
> => wb_do_writeback (ffffffff81169bd1)
> => bdi_writeback_workfn (ffffffff81169cca)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)

which is during allocation triggered by writeback IO.
> Here I have kworker/2:1H which appears to get its work scheduled from
> a softirq.
>
> <idle>-0 [002] 256032.890594: sched_wakeup: comm=kworker/2:1H pid=2176 prio=100 success=1 target_cpu=002
> <idle>-0 [002] 256032.890598: kernel_stack: <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => wake_up_worker (ffffffff810568c4)
> => insert_work (ffffffff8105821d)
> => __queue_work (ffffffff81058391)
> => queue_work_on (ffffffff81058605)
> => xfs_buf_ioend (ffffffff811c1a37)
> => _xfs_buf_ioend.constprop.1 (ffffffff811c1aaa)
> => xfs_buf_bio_end_io (ffffffff811c1f58)
> => bio_endio (ffffffff8117362d)
> => blk_update_request (ffffffff8124f1b8)
> => blk_update_bidi_request (ffffffff8124f511)
> => blk_end_bidi_request (ffffffff8124f59c)
> => blk_end_request (ffffffff8124f630)
> => scsi_io_completion (ffffffff8136156f)
> => scsi_finish_command (ffffffff81357e11)
> => scsi_softirq_done (ffffffff813613bf)
> => blk_done_softirq (ffffffff81255fe3)
> => __do_softirq (ffffffff81044230)
> => irq_exit (ffffffff81044505)
> => smp_call_function_single_interrupt (ffffffff81024475)
> => call_function_single_interrupt (ffffffff814d477a)
> => arch_cpu_idle (ffffffff8100ad8e)
> => cpu_startup_entry (ffffffff8107b39e)
> => start_secondary (ffffffff81b259f3)
> <idle>-0 [002] 256032.890603: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/2:1H next_pid=2176 next_prio=100
>

That's physical IO completion occurring, up to the point where XFS
queues it to a different workqueue...

>
> kworker/2:1H calls xfs_buf_unlock() which then wakes up kworker/4:0
>
>
> kworker/2:1H-2176 [002] 256032.890607: sched_wakeup: comm=kworker/4:0 pid=27520 prio=120 success=1 target_cpu=004
> kworker/2:1H-2176 [002] 256032.890609: kernel_stack: <stack trace>
> => ttwu_do_activate.constprop.34 (ffffffff8106c556)
> => try_to_wake_up (ffffffff8106e996)
> => wake_up_process (ffffffff8106ea87)
> => __up (ffffffff814c8dce)
> => up (ffffffff81065762)
> => xfs_buf_unlock (ffffffff811c163d)
> => xfs_buf_iodone_work (ffffffff811c1908)
> => xfs_buf_ioend (ffffffff811c198b)
> => xfs_buf_iodone_callbacks (ffffffff812222b6)
> => xfs_buf_iodone_work (ffffffff811c18be)
> => process_one_work (ffffffff81059104)
> => worker_thread (ffffffff8105a1bc)
> => kthread (ffffffff810605f0)
> => ret_from_fork (ffffffff814d395c)
>
> kworker/4:0 wakes up kworker/u49:0 which which releases xfs_iunlock()
> ending the delay.

And that's the IO completed.

>
> kworker/4:0-27520 [004] 256032.890646: sched_wakeup: comm=kworker/u49:0 pid=15748 prio=120 success=1 target_cpu=006
> kworker/u49:0-15748 [006] 256032.890665: funcgraph_entry: 1.143 us | xfs_iunlock();
> kworker/u49:0-15748 [006] 256032.890666: funcgraph_exit: ! 710304.855 us | }
>
>
> So to summarize it appears that most of the time was spent with
> kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> up by a softirq, is the one that calls xfs_buf_unlock(). Assuming I'm
> not missing some important intermediate steps does this provide any
> more information about what resource I'm actually waiting for? Does
> this point to any changes that happened after 3.4? Are there any tips
> that could help minimize these contentions?

The only difference between this and 3.4 is the allocation workqueue
thread. That, however, won't be introducing second long delays. What
you are seeing here is simply a the latency of waiting for
background metadata IO to complete during an allocation which has
the ilock held....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/