Re: [PATCH 04/12] writeback: switch to per-bdi threads forflushing data

From: Jens Axboe
Date: Mon May 25 2009 - 08:16:57 EST


On Mon, May 25 2009, Jan Kara wrote:
> On Mon 25-05-09 12:34:10, Jens Axboe wrote:
> > On Mon, May 25 2009, Jan Kara wrote:
> > > Hi Jens,
> > >
> > > I've noticed a few more things:
> > >
> > > On Mon 25-05-09 09:34:39, Jens Axboe wrote:
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 1137408..7cb4d02 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -19,6 +19,8 @@
> > > > #include <linux/sched.h>
> > > > #include <linux/fs.h>
> > > > #include <linux/mm.h>
> > > > +#include <linux/kthread.h>
> > > > +#include <linux/freezer.h>
> > > > #include <linux/writeback.h>
> > > > #include <linux/blkdev.h>
> > > > #include <linux/backing-dev.h>
> > > > @@ -61,10 +63,193 @@ int writeback_in_progress(struct backing_dev_info *bdi)
> > > > */
> > > > static void writeback_release(struct backing_dev_info *bdi)
> > > > {
> > > > - BUG_ON(!writeback_in_progress(bdi));
> > > > + WARN_ON_ONCE(!writeback_in_progress(bdi));
> > > > + bdi->wb_arg.nr_pages = 0;
> > > > + bdi->wb_arg.sb = NULL;
> > > > clear_bit(BDI_pdflush, &bdi->state);
> > > > }
> > > >
> > > > +int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
> > > > + long nr_pages, enum writeback_sync_modes sync_mode)
> > > > +{
> > > > + /*
> > > > + * This only happens the first time someone kicks this bdi, so put
> > > > + * it out-of-line.
> > > > + */
> > > > + if (unlikely(!bdi->task)) {
> > > > + bdi_add_default_flusher_task(bdi);
> > > > + return 1;
> > > > + }
> > > > +
> > > > + if (writeback_acquire(bdi)) {
> > > > + bdi->wb_arg.nr_pages = nr_pages;
> > > > + bdi->wb_arg.sb = sb;
> > > > + bdi->wb_arg.sync_mode = sync_mode;
> > > > + /*
> > > > + * make above store seen before the task is woken
> > > > + */
> > > > + smp_mb();
> > > > + wake_up(&bdi->wait);
> > > > + }
> > > Hmm, wouldn't the interface be more useful if we could just pass down the
> > > whole writeback_control to the flusher threads?
> >
> > Yeah, that's what the later patch does too, when support for > 1 thread
> > has been added.
> Really? I cannot see it in the series. I see that later you introduce
> struct bdi_work but it still contains just nr_pages, sb_data and sync_mode.
> What I meant was we could include 'struct writeback_control' into the bdi
> (or bdi_work later) and just pass it down so that caller can have more
> finegrained control of what writeback the thread is going to perform
> (like nonblocking, older_than_this, maybe something else).

Oh, I don't pass wbc down, but I do pass it async. The problem with
passing down the wbc is that it is allocated on the stack. So if we do
that, then we pretty much have to serialize on that. And I want to avoid
that.

> > > > void generic_sync_sb_inodes(struct super_block *sb,
> > > > struct writeback_control *wbc)
> > > > {
> > > > - const int is_blkdev_sb = sb_is_blkdev_sb(sb);
> > > > - struct backing_dev_info *bdi;
> > > > -
> > > > - mutex_lock(&bdi_lock);
> > > > - list_for_each_entry(bdi, &bdi_list, bdi_list)
> > > > - generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
> > > > - mutex_unlock(&bdi_lock);
> > > > + if (wbc->bdi)
> > > > + generic_sync_bdi_inodes(sb, wbc);
> > > > + else
> > > > + bdi_writeback_all(sb, wbc->nr_to_write, wbc->sync_mode);
> > > Hmm, you write in the changelog that bdi_writeback_all() writes inline
> > > now but AFAICS it still happens through the writeback threads. Which, on a
> > > second though probably *is* what we want because we want writeback to go in
> > > parallel on all devices we have.
> >
> > Note that this is patch 4, later it is sync. But yes, I think it should
> Ah, right. I missed that.
>
> > be async as well, but we do need a way to wait on the submitted work to
> > be processed in case of WB_SYNC_ALL.
> Yes, this is a bit complicated. You'd need some notification bdi_work is
> processed.

It was actually not that hard, I already did it. So right now
wb_writeback() does this:

while (work) {
get args;
clear pending;
do work;
};

If we just modify that to:

while (work) {
get args;
if (sync_mode == WB_SYNC_NONE)
clear pending;
do work;
if (sync_mode == WB_SYNC_ALL)
clear_pending;
};

We get notification when the work has actually been issued. Then we can
easily control whether or not to make a piece of work sync or not.

> > > > * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
> > > > * the whole world. Returns 0 if a pdflush thread was dispatched. Returns
> > > > * -1 if all pdflush threads were busy.
> > > > */
> > > > -int wakeup_pdflush(long nr_pages)
> > > > +void wakeup_flusher_threads(long nr_pages)
> > > > {
> > > > if (nr_pages == 0)
> > > > nr_pages = global_page_state(NR_FILE_DIRTY) +
> > > > global_page_state(NR_UNSTABLE_NFS);
> > > > - return pdflush_operation(background_writeout, nr_pages);
> > > > + bdi_writeback_all(NULL, nr_pages, WB_SYNC_NONE);
> > > > + return;
> > > > }
> > > ...
> > > > -
> > > > -static void laptop_flush(unsigned long unused)
> > > > -{
> > > > - sys_sync();
> > > > -}
> > > > -
> > > > static void laptop_timer_fn(unsigned long unused)
> > > > {
> > > > - pdflush_operation(laptop_flush, 0);
> > > > + wakeup_flusher_threads(0);
> > > > }
> > > >
> > > > /*
> > > Here you significantly change the behavior of laptop mode - previously it
> > > reliably synced all the data to disk once the "laptop writeout timeout"
> > > elapsed. The idea is that we want to write all the dirty data we have so
> > > that the drive can go to sleep for a longer period.
> > Yeah I know, I wrote the original laptop mode implementation :-)
> Sorry, I wasn't aware of that.
>
> > > Now you changed that to asynchronous WB_SYNC_NONE writeback of some
> > > number of pages. In particular if the disk gets congested, we'll just stop
> > > doing writeback which probably isn't what we want for laptop mode...
> >
> > Congestion doesn't matter, we don't do nonblocking writeout from the
> > threads at all. And it is sync later in the series.
> >
> > I'll switch it back to async in general throughout the series, and add
> > some way of making sure we actually have things submitted before doing
> > the filemap_fdatawait() in generic_sync_sb_inodes(). I think that should
> > fix it.
> OK, looking forward to your patches :)

Fiddling with it now, probably a v7 posting coming up tomorrow. And
thanks again for your great and detailed reviews!

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/