Re: [PATCH V2] writeback: fix hung_task alarm when sync block

From: Fengguang Wu
Date: Thu Jun 14 2012 - 09:26:51 EST


On Thu, Jun 14, 2012 at 09:35:34AM +0800, Wanpeng Li wrote:
> On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
> >Hi Jeff,
> >
> >On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> >> Wanpeng Li <liwp.linux@xxxxxxxxx> writes:
> >>
> >> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> > index f2d0109..df879ee 100644
> >> > --- a/fs/fs-writeback.c
> >> > +++ b/fs/fs-writeback.c
> >> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> >> >
> >> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >> > bdi_queue_work(sb->s_bdi, &work);
> >> > - wait_for_completion(&done);
> >> > + if (sysctl_hung_task_timeout_secs)
> >> > + while (!wait_for_completion_timeout(&done, HZ/2))
> >> > + ;
> >> > + else
> >> > + wait_for_completion(&done);
> >> > }
> >> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
> >>
> >> Is it really expected that writeback_inodes_sb_nr will routinely queue
> >> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> >> only entity issuing I/O)?
> >
> >Yes, in the case of syncing the whole superblock.
> >Basically sync() does its job in two steps:
> >
> >for all sb:
> > writeback_inodes_sb_nr() # WB_SYNC_NONE
> > sync_inodes_sb() # WB_SYNC_ALL
> >
> >> For devices that are really slow, it may make
> >> more sense to tune the system so that you don't have too much writeback
> >> I/O submitted at once. Dropping nr_requests for the given queue should
> >> fix this situation, I would think.
> >
> >The worried case is about sync() waiting
> >
> > (nr_dirty + nr_writeback) / write_bandwidth
> >
> >time, where it is nr_dirty that could grow rather large.
> >
> >For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
> >the sync() will have to wait for 100 seconds. If there are heavy
> >dirtiers running during the sync, it will typically take several
> >hundreds of seconds (which looks not that good, but still much better
> >than being livelocked in some old kernels)..
> >
> >> This really feels like we're papering over the problem.
> >
> >That's true. The majority users probably don't want to cache 100s
> >worth of data in memory. It may be worthwhile to add a new per-bdi
> >limit whose unit is number-of-seconds (of dirty data).
> Hi Fengguang,
>
> Maybe we have already have a threshold "dirty_expire_interval" to ensure
> pages will not dirty more than 30 seconds. Why should add a similar
> variable ? I think per-bdi flusher will try its best to flush dirty pages
> when waken up, just because the backing storages is too slow. :-)

dirty_expire_interval is used to start background writeback and not
used for throttling the dirtier task. So the heavy dirtier can still
outrun the flusher thread and push dirty pages all the way to the
dirty limits.

So to actually keep dirty pages under control 30s, we need some time
based _limit_. However I'm not sure whether we are going to do this.
Because the bandwidth estimation may not be accurate, especially that
it's initialized to an arbitrary 100MB/s for all newly detected disks,
whether it be slow USB keys or fast enterprise SSD. So now you claim
to throttle dirty pages under 30s worth of data, however the user can
still be able to dirty 300s worth of data on a newly plugged USB key.
That makes the interface look silly and not working to user expectations.

Another worry is that it will break some existing workloads because
the new limit triggers throttling for them earlier than before.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/