Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

From: Guoqing Jiang
Date: Wed Mar 15 2023 - 05:30:43 EST




On 3/15/23 11:02, Yu Kuai wrote:


在 2023/03/14 21:55, Guoqing Jiang 写道:


On 3/14/23 21:25, Marc Smith wrote:
On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
<guoqing.jiang@xxxxxxxxxxxxxxx> wrote:
Hi Donald,

On 2/8/21 19:41, Donald Buczek wrote:
Dear Guoqing,

On 08.02.21 15:53, Guoqing Jiang wrote:

On 2/8/21 12:38, Donald Buczek wrote:
5. maybe don't hold reconfig_mutex when try to unregister
sync_thread, like this.

          /* resync has finished, collect result */
          mddev_unlock(mddev);
md_unregister_thread(&mddev->sync_thread);
          mddev_lock(mddev);
As above: While we wait for the sync thread to terminate, wouldn't it
be a problem, if another user space operation takes the mutex?
I don't think other places can be blocked while hold mutex, otherwise
these places can cause potential deadlock. Please try above two lines
change. And perhaps others have better idea.
Yes, this works. No deadlock after >11000 seconds,

(Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
Great. I will send a formal patch with your reported-by and tested-by.

Thanks,
Guoqing
I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
of the patches that supposedly resolve this were applied to the stable
kernels, however, one was omitted due to a regression:
md: don't unregister sync_thread with reconfig_mutex held (upstream
commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
Hi, Guoqing,

Just borrow this thread to discuss, I think this commit might have
problem in some corner cases:

t1:                t2:
action_store
 mddev_lock
  if (mddev->sync_thread)
   mddev_unlock
   md_unregister_thread
                md_check_recovery
                 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
                 queue_work(md_misc_wq, &mddev->del_work)
   mddev_lock_nointr
   md_reap_sync_thread
   // clear running
 mddev_lock

t3:
md_start_sync
// running is not set

What does 'running' mean? MD_RECOVERY_RUNNING?

Our test report a problem that can be cause by this in theory, by we
can't be sure for now...

I guess you tried to describe racy between

action_store -> md_register_thread

and

md_start_sync -> md_register_thread

Didn't you already fix them in the series?

[PATCH -next 0/5] md: fix uaf for sync_thread

Sorry, I didn't follow the problem and also your series, I might try your
test with latest mainline kernel if the test is available somewhere.

We thought about how to fix this, instead of calling
md_register_thread() here to wait for sync_thread to be done
synchronisely,

IMO, md_register_thread just create and wake a thread, not sure why it
waits for sync_thread.

we do this asynchronously like what md_set_readonly() and  do_md_stop() does.

Still, I don't have clear picture about the problem, so I can't judge it.

Thanks,
Guoqing