Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

From: Eric W. Biederman
Date: Mon Apr 30 2018 - 12:26:01 EST

Next message: Ram Pai: "Re: [PATCH 0/9] [v3] x86, pkeys: two protection keys bug fixes"
Previous message: Pierre-Louis Bossart: "Re: [alsa-devel] [PATCH v2] ASoC: da7219: read fmw property to get mclk for non-dts systems"
In reply to: Oleg Nesterov: "Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process."
Next in thread: Andrey Grodzovsky: "Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Christian KÃnig <ckoenig.leichtzumerken@xxxxxxxxx> writes:

> Hi Eric,
>
> sorry for the late response, was on vacation last week.
>
> Am 26.04.2018 um 02:01 schrieb Eric W. Biederman:
>> Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> writes:
>>
>>> On 04/25/2018 01:17 PM, Oleg Nesterov wrote:
>>>> On 04/25, Andrey Grodzovsky wrote:
>>>>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be
>>>>> able to exit immediately
>>>>> and not wait for GPU jobs completion when the reason for reaching this code
>>>>> is because of KILL
>>>>> signal to the user process who opened the device file.
>>>> Can you hook f_op->flush method?
>
> THANKS! That sounds like a really good idea to me and we haven't investigated
> into that direction yet.

For the backwards compatibility concerns you cite below the flush method
seems a much better place to introduce the wait. You at least really
will be in a process context for that. Still might be in exit but at
least you will be legitimately be in a process.

>>> But this one is called for each task releasing a reference to the the file, so
>>> not sure I see how this solves the problem.
>> The big question is why do you need to wait during the final closing a
>> file?
>
> As always it's because of historical reasons. Initially user space pushed
> commands directly to a hardware queue and when a processes finished we didn't
> need to wait for anything.
>
> Then the GPU scheduler was introduced which delayed pushing the jobs to the
> hardware queue to a later point in time.
>
> This wait was then added to maintain backward compability and not break
> userspace (but see below).

That make sense.

>> The wait can be terminated so the wait does not appear to be simply a
>> matter of correctness.
>
> Well when the process is killed we don't care about correctness any more, we
> just want to get rid of it as quickly as possible (OOM situation etc...).
>
> But it is perfectly possible that a process submits some render commands and
> then calls exit() or terminates because of a SIGTERM, SIGINT etc.. In this case
> we need to wait here to make sure that all rendering is pushed to the hardware
> because the scheduler might need resources/settings from the file
> descriptor.
>
> For example if you just remove that wait you could close firefox and get garbage
> on the screen for a millisecond because the remaining rendering commands where
> not executed.
>
> So what we essentially need is to distinct between a SIGKILL (which means stop
> processing as soon as possible) and any other reason because then we don't want
> to annoy the user with garbage on the screen (even if it's just for a few
> milliseconds).

I see a couple of issues.

- Running the code in release rather than in flush.

Using flush will catch every close so it should be more backwards
compatible. f_op->flush always runs in process context so looking at
current makes sense.

- Distinguishing between death by SIGKILL and other process exit deaths.

In f_op->flush the code can test "((tsk->flags & PF_EXITING) &&
(tsk->code == SIGKILL))" to see if it was SIGKILL that terminated
the process.

- Dealing with stuck queues (where this patchset came in).

For stuck queues you are going to need a timeout instead of the current
indefinite wait after PF_EXITING is set. From what you have described a
few milliseconds should be enough. If PF_EXITING is not set you can
still just make the wait killable and skip the timeout if that will give
a better backwards compatible user experience.

What can't be done is try and catch SIGKILL after a process has called
do_exit. A dead process is a dead process.

Eric

Next message: Ram Pai: "Re: [PATCH 0/9] [v3] x86, pkeys: two protection keys bug fixes"
Previous message: Pierre-Louis Bossart: "Re: [alsa-devel] [PATCH v2] ASoC: da7219: read fmw property to get mclk for non-dts systems"
In reply to: Oleg Nesterov: "Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process."
Next in thread: Andrey Grodzovsky: "Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]