On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
Hi,
在 2023/06/13 22:50, Xiao Ni 写道:
在 2023/5/29 下午9:20, Yu Kuai 写道:
From: Yu Kuai <yukuai3@xxxxxxxxxx>
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
Hi Kuai
It looks like the above situation is more complex. It only needs a
normal write and md_write_start needs to
wait until the metadata is written to member disks, right? If so, it
doesn't need to introduce raid10 write failure
here. I guess, it should be your test case. It's nice, if you can put
your test steps in the patch. But for the analysis
of the deadlock here, it's better to be simple.
Test script can be found here, it's pretty easy to trigger:
https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@xxxxxxxxxxxxxxx/
Thanks for this.
While reviewing the related code, I found that io can only be added to
list bio_end_io_list from handle_write_completed() if such io failed, so
I think io failure is needed to trigger deadlock from daemon thread.
I think the key point is how MD_SB_CHANGE_PENDING is set:
1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
2) raid10_write_request() related to reshape;
3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
however, I was thinking this is not a common case;
1) is used here because it's quite easy to trigger and this is what
we meet in real test. 3) is possible but I will say let's keep 1), I
don't think it's necessary to reporduce this deadlock through another
path again.
It makes sense. Let's go back to the first path mentioned in the patch.
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
This is good.
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
I have a question here. It should run narrow_write_error in
handle_write_completed. In the test case, will narrow_write_error run
successfully? Or it fails and will call rdev_set_badblocks and
md_error. So MD_RECOVERY_PENDING will be set?
-> It's here, if the flag is set, bio won't be handled.
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
It's a little hard to understand. Because it doesn't show how normal
io waits for a superblock update. And based on your last email, I
guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
the flag can't be cleared, so the bios can't be added to
bio_end_io_list, so the io rquests can't be finished.
Regards
Xiao
Thanks,
Kuai
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
In the patch, sync_seq is added in md_reap_sync_thread. In
idle_sync_thread, if sync_seq isn't equal
mddev->sync_seq, it should mean there is someone that stops the sync
thread already, right? Why do
you say 'new started sync thread' here?
Regards
Xiao
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1]
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@xxxxxxxxxxxxx/T/#t
[2]
https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@xxxxxxxxxxxxxxx/
Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
drivers/md/md.c | 23 +++++++++++++++++++----
drivers/md/md.h | 2 ++
2 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63a993b52cd7..7912de0e4d12 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
atomic_set(&mddev->active, 1);
atomic_set(&mddev->openers, 0);
+ atomic_set(&mddev->sync_seq, 0);
spin_lock_init(&mddev->lock);
atomic_set(&mddev->flush_pending, 0);
init_waitqueue_head(&mddev->sb_wait);
@@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ /*
+ * Thread might be blocked waiting for metadata update which will
now
+ * never happen
+ */
+ md_wakeup_thread_directly(mddev->sync_thread);
mddev_unlock(mddev);
}
static void idle_sync_thread(struct mddev *mddev)
{
+ int sync_seq = atomic_read(&mddev->sync_seq);
+
mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}
@@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
*mddev)
mutex_init(&mddev->delete_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, mddev->sync_thread == NULL &&
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}
@@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
/* resync has finished, collect result */
md_unregister_thread(&mddev->sync_thread);
+ atomic_inc(&mddev->sync_seq);
+
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2fa903de5bd0..7cab9c7c45b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -539,6 +539,8 @@ struct mddev {
/* Used to synchronize idle and frozen for action_store() */
struct mutex sync_mutex;
+ /* The sequence number for sync thread */
+ atomic_t sync_seq;
bool has_superblocks:1;
bool fail_last_dev:1;
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel
.