Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

From: Ming Lei
Date: Mon Aug 29 2022 - 04:12:00 EST


On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 13:40, Ming Lei wrote:
> > On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> >> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> >> this rq is actually handled by an io_uring fallback wq. We have to
> >> end(abort) this rq since this fallback wq is a task other than the
> >> crashed task. However, current code does not call io_uring_cmd_done()
> >> at the same time but do it in ublk_cancel_queue(). With current design,
> >> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> >> which waits for the rq ended(aborted) in fallback wq. This implies that
> >> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> >> on the corresponding ioucmd in ublk_cancel_queue().
> >
> > Right.
> >
> >>
> >> However, while considering recovery feature, we cannot rely on
> >> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> >> rqs because we may not want any aborted rq. Besides, io_uring does not
> >> provide "flush fallback" machenism so we cannot trace this ioucmd.
> >
> > Why not?
> >
> > If user recovery is enabled, del_gendisk() can be replaced with
> > blk_mq_quiesce_queue(), then let abort work function do:
> >
> > - cancel all in-flight requests by holding them into requeue list
> > instead of finishing them as before, and this way is safe because
> > abort worker does know the ubq daemon is dying
> > - cancel pending commands as before, because the situation is same
> > with disk deleted or queue frozen
>
> The problem is: we cannot control when fallback wq is scheduled.
> So we are unsafe to call io_uring_cmd_done() in another process.

What is the other process?

It can't be fallback wq since any ublk request is aborted at the beginning
of __ublk_rq_task_work().

It shouldn't be the process calling ublk_cancel_dev(), since it is
safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.

Or others?

> Otherwise, there is a UAF, just as
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).

As I mentioned, del_gendisk() can be replaced with
blk_mq_quiesce_queue() in case of user recovery, then no any new
request can be queued after blk_mq_quiesce_queue() returns.

>
> Yeah I know the answer is very simple: flush the fallback wq.
> But here are two more questions:

I don't see why we need to flush fallback wq, care to provide some
details?

>
> (1) Should ublk_drv rely on the fallback wq machenism?
> IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
> because its implementation may change in the future.
> BTW, I think current ublk_rq_task_work_cb() is not correct because
> it does not always call io_uring_cmd_done() before returning.
> nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
> no matter the rq succeeds or fails.
>
> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
> it before starting a new process(recovery)?
> What if fallback wq is not scheduled immediately if there are many processes
> running and the system overhead is heavy. In this case the recovery process
> may wait for too long. Really we should not depend on fallback wq and please
> let the fallback wq complete the ioucmd itself.
>
> >
> > With this way, the current abort logic won't be changed much.
> >
> > And user recovery should only be started _after_ ublk device is found
> > as aborted.
>
> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.

That is different. If START_RECOVERY is only run on aborted device, the
recovery handler could be simplified.

>
> >
> >>
> >> The recovery machenism needs to complete all ioucmds of a dying ubq
> >> to avoid leaking io_uring ctx. But as talked above, we are unsafe
> >> to call io_uring_cmd_done() in the recovery task if fallback wq happens
> >> to run simultaneously. This is a UAF case because io_uring ctx may be
> >> freed. Actually a similar case happens in
> >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> >> re-issued request aborted previously to ioucmd's task_work).
> >
> > If you take the above approach, I guess there isn't such problem because
> > abort can handle the case well as before.
>
> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
> at the very beginning. But we decided to drop it because we don not want
> rely on 'flush fallback wq' machenism, which
> makes ublk_drv rely on io_uring's internal implementation.

Then the focus is 'flush fallback wq', please see my above question.


Thanks,
Ming