Re: bcache_writeback: bch_writeback_thread...blk_queue_bio IO hang [was BUG: soft lockup]

From: Eric Wheeler
Date: Mon Feb 01 2016 - 14:42:55 EST


[ changed subject, more below ]

On Fri, 29 Jan 2016, Johannes Thumshirn wrote:

> [ +cc Kent ]
>
> On Wed, Jan 27, 2016 at 02:57:25PM +0000, Yannis Aribaud wrote:
> > Hi,
> >
> > After several weeks using the 4.2.6 kernel + patches from Ewheeler we just ran into a crash again.
> > This time the kernel was still running and the server was responsive but not able to do any IO on the bcache devices.
> >
> > [696983.683498] bcache_writebac D ffffffff810643df 0 5741 2 0x00000000
> > [696983.683505] ffff88103d01f180 0000000000000046 ffff88107842d000 ffffffff811a95cd
> > [696983.683510] 0000000000000000 ffff8810388c4000 ffff88103d01f180 0000000000000001
> > [696983.683514] ffff882034ae0c10 0000000000000000 ffff882034ae0000 ffffffff8139601e
> > [696983.683518] Call Trace:
> > [696983.683530] [<ffffffff811a95cd>] ? blk_queue_bio+0x262/0x279
> > [696983.683539] [<ffffffff8139601e>] ? schedule+0x6b/0x78
> > [696983.683553] [<ffffffffa032ce9b>] ? closure_sync+0x66/0x91 [bcache]
> > [696983.683563] [<ffffffffa033c89f>] ? bch_writeback_thread+0x622/0x6b5 [bcache]
> > [696983.683569] [<ffffffff8100265c>] ? __switch_to+0x1de/0x3f7
> > [696983.683578] [<ffffffffa033c89f>] ? bch_writeback_thread+0x622/0x6b5 [bcache]
> > [696983.683586] [<ffffffffa033c27d>] ? write_dirty_finish+0x1bf/0x1bf [bcache]
> > [696983.683594] [<ffffffff810589d6>] ? kthread+0x99/0xa1
> > [696983.683598] [<ffffffff8105893d>] ? kthread_parkme+0x16/0x16
> > [696983.683603] [<ffffffff813986df>] ? ret_from_fork+0x3f/0x70
> > [696983.683607] [<ffffffff8105893d>] ? kthread_parkme+0x16/0x16
> >
> > Don't know if this help.
> > Unfortunately I thing that we will rollback and stop using Bcache unless this is really fixed :/
> >
>
> Hi Yannis,
>
> Do you have a machine with a bcache setup running where you can reproduce the
> error? Or do you know a method to reproduce the error?
>
> What I'd be interested in is which locks are held when it locks up (you can
> acquire this information with SysRq+d or echo d > /proc/sysrq-trigger.
>
> Kent, do you have an idea what's happening here?

Are you using md-based raid5/raid6?

If so, it could be the raid_partial_stripes_expensive bug.

If it *is* the bcache optimizations around partial_stripes_expensive, then
please try the patch below and then do this *before* loading the bcache
module (or at least before registering the bcache backing volume):
echo 0 > /sys/block/BLK/queue/limits/raid_partial_stripes_expensive
where BLK is your backing volume.

I wrote this patch because we use hardware RAID5/6 and wanted to get the
partial_stripes_expensive optimizations on by setting
raid_partial_stripes_expensive=1 and io_opt=our_stride_width.
Unfortunately it caused the backtrace in the LKML thread below, so we
stopped using it.

See also this thread, however, it shows a backtrace prior to
removing the bio splitting code:
https://lkml.org/lkml/2016/1/7/844

Note that commit 749b61dab30736eb95b1ee23738cae90973d4fc3 might not
exactly address this issue, but it might prevent full hangs. Make sure
you cherry-pick commit 749b61dab30736eb95b1ee23738cae90973d4fc3 and
hand-clean-up as necessary. It simplifies the bio code and deletes a bunch
of stuff. It showed up in 4.4 and isn't in my patchset.

After commit the commit above and setting
raid_partial_stripes_expensive=1
we still get errors like this:
bcache: bch_count_io_errors() dm-3: IO error on writing data to cache,
recovering but they don't lock up the system. Ultimately we run with
raid_partial_stripes_expensive=0 because of these related problemsand
haven't had any issues. Also, fwiw, we run 4.1.y as our stable branch
backed by hardware raid.

See my queue_limits patch below which exposes
<struct request_queue>::limits
via sysfs.

If anyone does have a fix for this, I would love to see it---because then
we could use the bcache raid_partial_stripes_expensive/opt_io
optimizations!

-Eric


diff --git a/block/blk-core.c b/block/blk-core.c
index 03b5f8d..54006aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -642,6 +642,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);

kobject_init(&q->kobj, &blk_queue_ktype);
+ kobject_init(&q->ql_kobj, &blk_queue_limits_ktype);

mutex_init(&q->sysfs_lock);
spin_lock_init(&q->__queue_lock);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 2b8fd30..9c29b44 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -430,6 +430,119 @@ static struct attribute *default_attrs[] = {
NULL,
};

+/* First we define a show/store function generator
+ * for `struct queue_limits` */
+#define QL_STORE_FUNC_NAME(name) queue_limits_store_##name
+#define QL_STORE_FUNC(name, type) \
+ static ssize_t \
+ QL_STORE_FUNC_NAME(name) (struct request_queue *q, const char *page, size_t count)\
+ { \
+ unsigned long val; \
+ ssize_t ret = queue_var_store(&val, page, count); \
+ if (ret < 0) \
+ return ret; \
+ q->limits.name = (type) val; \
+ return ret; \
+ }
+
+#define QL_SHOW_FUNC_NAME(name) queue_limits_show_##name
+#define QL_SHOW_FUNC(name, type) \
+ static ssize_t \
+ QL_SHOW_FUNC_NAME(name) (struct request_queue *q, char *page) \
+ { \
+ return queue_var_show((unsigned long)q->limits.name, (page)); \
+ }
+
+
+/* Next, the struct generator that defines the
+ * structs for the queue_limits sysfs interface */
+#define QL_SYSFS_STRUCT_NAME(name) queue_limits_struct_##name
+#define QL_SYSFS_STRUCT(__name, __mode, __show, __store) \
+ static struct queue_sysfs_entry QL_SYSFS_STRUCT_NAME(__name) = {\
+ .attr = {.name = #__name, .mode = __mode }, \
+ .show = __show, \
+ .store = __store, \
+ };
+
+/* Define common read-only/read-write with functions and mode flags: */
+#define QL_SYSFS_RO(name, type) \
+ QL_SHOW_FUNC(name, type); \
+ QL_SYSFS_STRUCT(name, S_IRUGO, QL_SHOW_FUNC_NAME(name), NULL)
+
+#define QL_SYSFS_RW(name, type) \
+ QL_SHOW_FUNC(name, type); \
+ QL_STORE_FUNC(name, type); \
+ QL_SYSFS_STRUCT(name, S_IRUGO | S_IWUSR, \
+ QL_SHOW_FUNC_NAME(name), QL_STORE_FUNC_NAME(name));
+
+
+QL_SYSFS_RO(bounce_pfn, unsigned long);
+QL_SYSFS_RO(seg_boundary_mask, unsigned long);
+
+QL_SYSFS_RO(max_hw_sectors, unsigned int);
+QL_SYSFS_RO(chunk_sectors, unsigned int);
+QL_SYSFS_RO(max_sectors, unsigned int);
+QL_SYSFS_RO(max_segment_size, unsigned int);
+QL_SYSFS_RO(physical_block_size, unsigned int);
+QL_SYSFS_RO(alignment_offset, unsigned int);
+QL_SYSFS_RO(io_min, unsigned int);
+QL_SYSFS_RW(io_opt, unsigned int);
+QL_SYSFS_RO(max_discard_sectors, unsigned int);
+QL_SYSFS_RO(max_write_same_sectors, unsigned int);
+QL_SYSFS_RO(discard_granularity, unsigned int);
+QL_SYSFS_RO(discard_alignment, unsigned int);
+
+QL_SYSFS_RO(logical_block_size, unsigned short);
+QL_SYSFS_RO(max_segments, unsigned short);
+QL_SYSFS_RO(max_integrity_segments, unsigned short);
+
+QL_SYSFS_RO(misaligned, unsigned char);
+QL_SYSFS_RO(discard_misaligned, unsigned char);
+QL_SYSFS_RO(cluster, unsigned char);
+QL_SYSFS_RO(discard_zeroes_data, unsigned char);
+QL_SYSFS_RW(raid_partial_stripes_expensive, unsigned char);
+
+
+/* sysfs attribute list for queue_limits */
+static struct attribute *queue_limits_attrs[] = {
+ &QL_SYSFS_STRUCT_NAME(bounce_pfn).attr,
+ &QL_SYSFS_STRUCT_NAME(seg_boundary_mask).attr,
+
+ &QL_SYSFS_STRUCT_NAME(max_hw_sectors).attr,
+ &QL_SYSFS_STRUCT_NAME(chunk_sectors).attr,
+ &QL_SYSFS_STRUCT_NAME(max_sectors).attr,
+ &QL_SYSFS_STRUCT_NAME(max_segment_size).attr,
+ &QL_SYSFS_STRUCT_NAME(physical_block_size).attr,
+ &QL_SYSFS_STRUCT_NAME(alignment_offset).attr,
+ &QL_SYSFS_STRUCT_NAME(io_min).attr,
+ &QL_SYSFS_STRUCT_NAME(io_opt).attr,
+ &QL_SYSFS_STRUCT_NAME(max_discard_sectors).attr,
+ &QL_SYSFS_STRUCT_NAME(max_write_same_sectors).attr,
+ &QL_SYSFS_STRUCT_NAME(discard_granularity).attr,
+ &QL_SYSFS_STRUCT_NAME(discard_alignment).attr,
+
+ &QL_SYSFS_STRUCT_NAME(logical_block_size).attr,
+ &QL_SYSFS_STRUCT_NAME(max_segments).attr,
+ &QL_SYSFS_STRUCT_NAME(max_integrity_segments).attr,
+
+ &QL_SYSFS_STRUCT_NAME(misaligned).attr,
+ &QL_SYSFS_STRUCT_NAME(discard_misaligned).attr,
+ &QL_SYSFS_STRUCT_NAME(cluster).attr,
+ &QL_SYSFS_STRUCT_NAME(discard_zeroes_data).attr,
+ &QL_SYSFS_STRUCT_NAME(raid_partial_stripes_expensive).attr,
+
+ NULL
+};
+/* cleanup */
+#undef QL_SYSFS_RW
+#undef QL_SYSFS_RO
+#undef QL_SYSFS_STRUCT
+#undef QL_SYSFS_STRUCT_NAME
+#undef QL_SHOW_FUNC
+#undef QL_SHOW_FUNC_NAME
+#undef QL_STORE_FUNC
+#undef QL_STORE_FUNC_NAME
+
#define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr)

static ssize_t
@@ -474,6 +587,48 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
return res;
}

+static ssize_t
+queue_limits_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
+{
+ struct queue_sysfs_entry *entry = to_queue(attr);
+ struct request_queue *q =
+ container_of(kobj, struct request_queue, ql_kobj);
+ ssize_t res;
+
+ if (!entry->show)
+ return -EIO;
+ mutex_lock(&q->sysfs_lock);
+ if (blk_queue_dying(q)) {
+ mutex_unlock(&q->sysfs_lock);
+ return -ENOENT;
+ }
+ res = entry->show(q, page);
+ mutex_unlock(&q->sysfs_lock);
+ return res;
+}
+
+static ssize_t
+queue_limits_attr_store(struct kobject *kobj, struct attribute *attr,
+ const char *page, size_t length)
+{
+ struct queue_sysfs_entry *entry = to_queue(attr);
+ struct request_queue *q;
+ ssize_t res;
+
+ if (!entry->store)
+ return -EIO;
+
+ q = container_of(kobj, struct request_queue, ql_kobj);
+ mutex_lock(&q->sysfs_lock);
+ if (blk_queue_dying(q)) {
+ mutex_unlock(&q->sysfs_lock);
+ return -ENOENT;
+ }
+ res = entry->store(q, page, length);
+ mutex_unlock(&q->sysfs_lock);
+ return res;
+}
+
static void blk_free_queue_rcu(struct rcu_head *rcu_head)
{
struct request_queue *q = container_of(rcu_head, struct request_queue,
@@ -531,6 +686,17 @@ static const struct sysfs_ops queue_sysfs_ops = {
.store = queue_attr_store,
};

+static const struct sysfs_ops queue_limits_sysfs_ops = {
+ .show = queue_limits_attr_show,
+ .store = queue_limits_attr_store,
+};
+
+struct kobj_type blk_queue_limits_ktype = {
+ .sysfs_ops = &queue_limits_sysfs_ops,
+ .default_attrs = queue_limits_attrs,
+ .release = blk_release_queue,
+};
+
struct kobj_type blk_queue_ktype = {
.sysfs_ops = &queue_sysfs_ops,
.default_attrs = default_attrs,
@@ -573,6 +739,12 @@ int blk_register_queue(struct gendisk *disk)
}

kobject_uevent(&q->kobj, KOBJ_ADD);
+
+ ret = kobject_add(&q->ql_kobj, kobject_get(&q->kobj), "%s", "limits");
+ if (ret < 0)
+ return ret;
+
+ kobject_uevent(&q->ql_kobj, KOBJ_ADD);

if (q->mq_ops)
blk_mq_register_disk(disk);
@@ -605,6 +777,10 @@ void blk_unregister_queue(struct gendisk *disk)
if (q->request_fn)
elv_unregister_queue(q);

+ kobject_uevent(&q->ql_kobj, KOBJ_REMOVE);
+ kobject_del(&q->ql_kobj);
+ kobject_put(&q->kobj);
+
kobject_uevent(&q->kobj, KOBJ_REMOVE);
kobject_del(&q->kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
diff --git a/block/blk.h b/block/blk.h
index 43b0361..1342e9a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -28,6 +28,7 @@ struct blk_flush_queue {
extern struct kmem_cache *blk_requestq_cachep;
extern struct kmem_cache *request_cachep;
extern struct kobj_type blk_queue_ktype;
+extern struct kobj_type blk_queue_limits_ktype;
extern struct ida blk_queue_ida;

static inline struct blk_flush_queue *blk_get_flush_queue(
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5d93a66..b58fbf3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -401,6 +401,12 @@ struct request_queue {
* mq queue kobject
*/
struct kobject mq_kobj;
+
+ /*
+ * queue_limits kobject
+ */
+ struct kobject ql_kobj;
+

#ifdef CONFIG_PM
struct device *dev;