Re: [PATCH v3] md: fix duplicate filename for rdev

From: Yu Kuai
Date: Tue May 23 2023 - 21:33:19 EST


Hi,

在 2023/05/24 2:05, Song Liu 写道:
On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

From: Yu Kuai <yukuai3@xxxxxxxxxx>

Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.

Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():

1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);

So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.

flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
a small window between flush_workqueue() and mddev_lock() that other
contexts can queue new work, hence the problem is not solved completely.

sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.

A new mdadm regression test is proposed as well([1]).

[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@xxxxxxxxxxxxxxx/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>

Thanks for the fix! I made the following changes and applied it
to md-next:

1. remove md_rdev->del_work, which is not used any more;
2. change list_empty_safe to list_empty protected by the mutex, as
list_empty_safe doesn't seem safe here.
Yes, this make sense, we must make sure caller won't see stale rdev
through sysfs:

t1: remove rdev t2:
mutex_lock(reconfig_mutex)
list_add
mutex_unlock(reconfig_mutex)
mutex_lock(reconfig_mutex)
mutex_unlock(reconfig_mutex)
mutex_lock(delete_mutex)
list_del_init
list_empty_careful
-> list is empty now, return, caller will think rdev is removed,
however, since export_rdev is not called yet, adding this rdev again
will fail.
kobject_del

hold mutex is safe, and I think performance should be ok because remove
rdev is not hot path. If we don't want to hold a new mutex in hot path,
perhaps list_empty_careful can work with following changes, remove rdev
from the list after sysfs entries is removed:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 296885798a2b..84dce5822f91 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -758,8 +758,8 @@ static void md_free_rdev(struct mddev *mddev)

mutex_lock(&mddev->delete_mutex);
list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
- list_del_init(&rdev->same_set);
kobject_del(&rdev->kobj);
+ list_del_init(&rdev->same_set);
export_rdev(rdev);
}
mutex_unlock(&mddev->delete_mutex);


Please let me know if either change doesn't make sense.

Thanks,
Song
.