Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

From: Yu Kuai
Date: Wed Jun 14 2023 - 02:05:16 EST

Next message: Daejun Park: "RE: [f2fs-dev] [PATCH] f2fs: do not issue small discard commands during checkpoint"
Previous message: Eric W. Biederman: "Can vhost translate to io_uring?"
In reply to: Xiao Ni: "Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock"
Next in thread: Xiao Ni: "Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

在 2023/06/14 11:47, Xiao Ni 写道:

On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/06/13 22:50, Xiao Ni 写道:

在 2023/5/29 下午9:20, Yu Kuai 写道:

From: Yu Kuai <yukuai3@xxxxxxxxxx>

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

Hi Kuai

It looks like the above situation is more complex. It only needs a
normal write and md_write_start needs to

wait until the metadata is written to member disks, right? If so, it
doesn't need to introduce raid10 write failure

here. I guess, it should be your test case. It's nice, if you can put
your test steps in the patch. But for the analysis

of the deadlock here, it's better to be simple.

Test script can be found here, it's pretty easy to trigger:

https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@xxxxxxxxxxxxxxx/

Thanks for this.

While reviewing the related code, I found that io can only be added to
list bio_end_io_list from handle_write_completed() if such io failed, so
I think io failure is needed to trigger deadlock from daemon thread.

I think the key point is how MD_SB_CHANGE_PENDING is set:

1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
2) raid10_write_request() related to reshape;
3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
however, I was thinking this is not a common case;

1) is used here because it's quite easy to trigger and this is what
we meet in real test. 3) is possible but I will say let's keep 1), I
don't think it's necessary to reporduce this deadlock through another
path again.

It makes sense. Let's go back to the first path mentioned in the patch.

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

This is good.

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

I have a question here. It should run narrow_write_error in
handle_write_completed. In the test case, will narrow_write_error run
successfully? Or it fails and will call rdev_set_badblocks and
md_error. So MD_RECOVERY_PENDING will be set?

r10_bio will always be added to bio_end_io_list, no matter
narrow_write_error() succeed or not. The dependecy chain 1 here is just
indicate handle this r10_bio will wait for updating super block, it's
not where MD_RECOVERY_PENDING is set...

And MD_RECOVERY_PENDING can be set from narrow_write_error() and other
places where rdev_set_badblocks() is called.

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))

-> It's here, if the flag is set, bio won't be handled.

list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

It's a little hard to understand. Because it doesn't show how normal
io waits for a superblock update. And based on your last email, I
guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
the flag can't be cleared, so the bios can't be added to
bio_end_io_list, so the io rquests can't be finished.

It's not that bio can't be added to bio_end_io_list, it's that bio in
this list can't be handled if sb_flags is set.

Thanks,
Kuai

Regards
Xiao

Thanks,
Kuai

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

In the patch, sync_seq is added in md_reap_sync_thread. In
idle_sync_thread, if sync_seq isn't equal

mddev->sync_seq, it should mean there is someone that stops the sync
thread already, right? Why do

you say 'new started sync thread' here?

Regards

Xiao

Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1]
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@xxxxxxxxxxxxx/T/#t

[2]
https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@xxxxxxxxxxxxxxx/

Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
drivers/md/md.c | 23 +++++++++++++++++++----
drivers/md/md.h | 2 ++
2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63a993b52cd7..7912de0e4d12 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
atomic_set(&mddev->active, 1);
atomic_set(&mddev->openers, 0);
+ atomic_set(&mddev->sync_seq, 0);
spin_lock_init(&mddev->lock);
atomic_set(&mddev->flush_pending, 0);
init_waitqueue_head(&mddev->sb_wait);
@@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ /*
+ * Thread might be blocked waiting for metadata update which will
now
+ * never happen
+ */
+ md_wakeup_thread_directly(mddev->sync_thread);
mddev_unlock(mddev);
}
static void idle_sync_thread(struct mddev *mddev)
{
+ int sync_seq = atomic_read(&mddev->sync_seq);
+
mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}
@@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
*mddev)
mutex_init(&mddev->delete_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, mddev->sync_thread == NULL &&
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}
@@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
/* resync has finished, collect result */
md_unregister_thread(&mddev->sync_thread);
+ atomic_inc(&mddev->sync_seq);
+
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2fa903de5bd0..7cab9c7c45b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -539,6 +539,8 @@ struct mddev {
/* Used to synchronize idle and frozen for action_store() */
struct mutex sync_mutex;
+ /* The sequence number for sync thread */
+ atomic_t sync_seq;
bool has_superblocks:1;
bool fail_last_dev:1;

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel

.

Next message: Daejun Park: "RE: [f2fs-dev] [PATCH] f2fs: do not issue small discard commands during checkpoint"
Previous message: Eric W. Biederman: "Can vhost translate to io_uring?"
In reply to: Xiao Ni: "Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock"
Next in thread: Xiao Ni: "Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]