Re: [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones

From: Jan Kara
Date: Fri Oct 02 2015 - 10:12:20 EST


On Tue 29-09-15 12:47:52, Tejun Heo wrote:
> bdi_for_each_wb() is used in several places to wake up or issue
> writeback work items to all wb's (bdi_writeback's) on a given bdi.
> The iteration is performed by walking bdi->cgwb_tree; however, the
> tree only indexes wb's which are currently active.
>
> For example, when a memcg gets associated with a different blkcg, the
> old wb is removed from the tree so that the new one can be indexed.
> The old wb starts dying from then on but will linger till all its
> inodes are drained. As these dying wb's may still host dirty inodes,
> writeback operations which affect all wb's must include them.
> bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
> failing to sync the inodes belonging to those wb's.
>
> This patch adds a RCU protected @bdi->wb_list which lists all wb's
> beloinging to that bdi. wb's are added on creation and removed on
> release rather than on the start of destruction. bdi_for_each_wb()
> usages are replaced with list_for_each[_continue]_rcu() iterations
> over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-and-tested-by: Artem Bityutskiy <dedekind1@xxxxxxxxx>
> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
> Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@xxxxxxxxx
> ---
> fs/fs-writeback.c | 28 ++++++++++++------
> include/linux/backing-dev-defs.h | 3 ++
> include/linux/backing-dev.h | 63 ----------------------------------------
> mm/backing-dev.c | 17 ++++++++++-
> mm/page-writeback.c | 3 +-
> 5 files changed, 39 insertions(+), 75 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0da306..afa4848 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
> struct wb_writeback_work *base_work,
> bool skip_if_busy)
> {
> - int next_memcg_id = 0;
> - struct bdi_writeback *wb;
> - struct wb_iter iter;
> + struct bdi_writeback *last_wb = NULL;
> + struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
> + struct bdi_writeback, bdi_node);
>
> might_sleep();
> restart:
> rcu_read_lock();
> - bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
> + list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
> DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
> struct wb_writeback_work fallback_work;
> struct wb_writeback_work *work;
> long nr_pages;
>
> + if (last_wb) {
> + wb_put(last_wb);
> + last_wb = NULL;
> + }

But you seem to forget to drop last_wb reference in case this was the last
wb in the list, don't you?

> +
> /* SYNC_ALL writes out I_DIRTY_TIME too */
> if (!wb_has_dirty_io(wb) &&
> (base_work->sync_mode == WB_SYNC_NONE ||
> @@ -819,7 +824,14 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>
> wb_queue_work(wb, work);
>
> - next_memcg_id = wb->memcg_css->id + 1;
> + /*
> + * Pin @wb so that it stays on @bdi->wb_list. This allows
> + * continuing iteration from @wb after dropping and
> + * regrabbing rcu read lock.
> + */
> + wb_get(wb);
> + last_wb = wb;
> +
> rcu_read_unlock();
> wb_wait_for_completion(bdi, &fallback_work_done);
> goto restart;
...
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 2df8ddc..c48070b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
> radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
> cgwb_kill(*slot);
>
> + /* wb may get released after @bdi is freed, sever list head */
> + list_del(&bdi->wb_list);
> +

But we wait for bdi->usage_cnt to drop to 0 which means there's no wb,
don't we? What am I missing?

> @@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
>
> int bdi_init(struct backing_dev_info *bdi)
> {
> + int ret;
> +
> bdi->dev = NULL;
>
> bdi->min_ratio = 0;
> bdi->max_ratio = 100;
> bdi->max_prop_frac = FPROP_FRAC_BASE;
> INIT_LIST_HEAD(&bdi->bdi_list);
> + INIT_LIST_HEAD(&bdi->wb_list);
> init_waitqueue_head(&bdi->wb_waitq);
>
> - return cgwb_bdi_init(bdi);
> + ret = cgwb_bdi_init(bdi);
> +
> + list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);

Won't this be more logical in cgwb_bdi_init()?

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/