Re: 回复: [PATCH] io-wq: set task TASK_INTERRUPTIBLE state before schedule_timeout

From: Jens Axboe
Date: Wed Oct 28 2020 - 17:37:50 EST


On 10/27/20 8:47 PM, Zhang, Qiang wrote:
>
>
> ________________________________________
> 发件人: Jens Axboe <axboe@xxxxxxxxx>
> 发送时间: 2020年10月27日 21:35
> 收件人: Zhang, Qiang
> 抄送: io-uring@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> 主题: Re: [PATCH] io-wq: set task TASK_INTERRUPTIBLE state before schedule_timeout
>
> On 10/26/20 9:09 PM, qiang.zhang@xxxxxxxxxxxxx wrote:
>> From: Zqiang <qiang.zhang@xxxxxxxxxxxxx>
>>
>> In 'io_wqe_worker' thread, if the work which in 'wqe->work_list' be
>> finished, the 'wqe->work_list' is empty, and after that the
>> '__io_worker_idle' func return false, the task state is TASK_RUNNING,
>> need to be set TASK_INTERRUPTIBLE before call schedule_timeout func.
>>
>> I don't think that's safe - what if someone added work right before you
>> call schedule_timeout_interruptible? Something ala:
>>
>>
>> io_wq_enqueue()
>> set_current_state(TASK_INTERRUPTIBLE();
>> schedule_timeout(WORKER_IDLE_TIMEOUT);
>>
>> then we'll have work added and the task state set to running, but the
>> worker itself just sets us to non-running and will hence wait
>> WORKER_IDLE_TIMEOUT before the work is processed.
>>
>> The current situation will do one extra loop for this case, as the
>> schedule_timeout() just ends up being a nop and we go around again
>
> although the worker task state is running, due to the call
> schedule_timeout, the current worker still possible to be switched
> out. if set current worker task is no-running, the current worker be
> switched out, but the schedule will call io_wq_worker_sleeping func
> to wake up free worker task, if wqe->free_list is not empty.

It'll only be swapped out for TASK_RUNNING if we should be running other
work, which would happen on next need-resched event anyway. And the miss
you're describing is an expensive one, as it entails creating a new
thread and switching to that. That's not a great way to handle a race.

So I'm a bit puzzled here - yes we'll do an extra loop and check for the
dropping of mm, but that's really minor. The solution is a _lot_ more
expensive for hitting the race of needing a new worker, but missing it
because you unconditionally set the task to non-running. On top of that,
it's also not the idiomatic way to wait for events, which is typically:

is event true, break if so
set_current_state(TASK_INTERRUPTIBLE);
event comes in, task set runnable
check again, schedule
doesn't schedule, since we were set runnable

or variants thereof, using waitqueues.

So while I'm of course not opposed to fixing the io-wq loop so that we
don't do that last loop when going idle, a) it basically doesn't matter,
and b) the proposed solution is much worse. If there was a more elegant
solution without worse side effects, then we can discuss that.

--
Jens Axboe