[PATCH v2 5/6] blk-mq: fix freeze queue race

From: Akinobu Mita
Date: Thu Jul 02 2015 - 10:31:23 EST


There are several race conditions while freezing queue.

When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed. percpu refcount should be re-initialized before killed again.

Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:

1. q->mq_usage_counter is initialized in atomic mode.
(atomic counter: 1)

2. After the disk registration, a process like systemd-udev starts
accessing the disk, and successfully increases refcount successfully
by percpu_ref_tryget_live() in blk_mq_queue_enter().
(atomic counter: 2)

3. In the final stage of initialization, q->mq_usage_counter is being
switched to percpu mode by percpu_ref_switch_to_percpu() in
blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
the process is rescheduled in the middle of switching when calling
wait_event() in __percpu_ref_switch_to_percpu().
(atomic counter: 2)

4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
request queue. q->mq_usage_counter is decreased and marked as
DEAD. Wait until all requests have finished.
(atomic counter: 1)

5. The process rescheduled in the step 3. is resumed and finishes
all remaining work in __percpu_ref_switch_to_percpu().
A bias value is added to atomic counter of q->mq_usage_counter.
(atomic counter: PERCPU_COUNT_BIAS + 1)

6. A request issed in the step 2. is finished and q->mq_usage_counter
is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
so atomic counter is decreased and no release handler is called.
(atomic counter: PERCPU_COUNT_BIAS)

7. CPU hotplug handling in the step 4. will wait forever as
q->mq_usage_counter will never be zero.

Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time. Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.

Fix those races by serializing with per-queue mutex.

Signed-off-by: Akinobu Mita <akinobu.mita@xxxxxxxxx>
Cc: Jens Axboe <axboe@xxxxxxxxx>
Cc: Ming Lei <tom.leiming@xxxxxxxxx>
---
block/blk-core.c | 1 +
block/blk-mq-sysfs.c | 2 ++
block/blk-mq.c | 8 ++++++++
include/linux/blkdev.h | 6 ++++++
4 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index bbf67cd..f3c5ae2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -687,6 +687,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

init_waitqueue_head(&q->mq_freeze_wq);
+ mutex_init(&q->mq_freeze_lock);
mutex_init(&q->mq_sysfs_lock);

if (blkcg_init_queue(q))
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 79a3e8d..8448513 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -413,7 +413,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
/* see blk_register_queue() */
void blk_mq_finish_init(struct request_queue *q)
{
+ mutex_lock(&q->mq_freeze_lock);
percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+ mutex_unlock(&q->mq_freeze_lock);
}

int blk_mq_register_disk(struct gendisk *disk)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ad07373..f31de35 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,11 +115,15 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

+ mutex_lock(&q->mq_freeze_lock);
+
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->mq_usage_counter);
blk_mq_run_hw_queues(q, false);
}
+
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);

@@ -143,12 +147,16 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
{
int freeze_depth;

+ mutex_lock(&q->mq_freeze_lock);
+
freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(&q->mq_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}
+
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c56f5a6..0bf8bea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -457,6 +457,12 @@ struct request_queue {
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
+ /*
+ * Protect concurrent access to mq_usage_counter by
+ * percpu_ref_switch_to_percpu(), percpu_ref_kill(), and
+ * percpu_ref_reinit().
+ */
+ struct mutex mq_freeze_lock;
struct percpu_ref mq_usage_counter;
struct list_head all_q_node;

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/