On 04/30/2018 02:29 PM, Christian KÃnig wrote:
Am 30.04.2018 um 18:10 schrieb Andrey Grodzovsky:
On 04/30/2018 12:00 PM, Oleg Nesterov wrote:
On 04/30, Andrey Grodzovsky wrote:
What about changing PF_SIGNALED to PF_EXITING inlet me repeat, please don't use task->exit_code. And in fact this check is racy
drm_sched_entity_do_release
-ÂÂÂÂÂÂ if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
+ÂÂÂÂÂ if ((current->flags & PF_EXITING) && current->exit_code == SIGKILL)
But this doesn't matter. Say, we can trivially add SIGNAL_GROUP_KILLED_BY_SIGKILL,
or do something else,
Can you explain where is the race and what is a possible alternative then ?
The race is that the release doesn't necessarily comes from the process/context which used the fd.
E.g. it is just called when the last reference count goes away, but that can be anywhere not related to the original process using it, e.g. in a kernel thread or a debugger etc...
I still don't see how it is a problem, if release comes from another task, then our process (let's say Firefox who received SIGKILL) won't even get here since fput will not call .release so it will die instantly,
the last process who holds the reference (let's say the debugger) when finish will just go to wait_event_timeout and wait for SW queue to be empty from jobs (if any). So all the jobs will have their chance to get to HW anyway.
The approach with the flush is indeed a really nice idea and I bite myself to not had that previously as well.
Regarding your request from another email to investigate more on .flush
Looked at the code and did some reading -
From LDD3
"The flush operation is invoked when a process closes its copy of a file descriptor for a device; it should execute (and wait for) any outstanding operations on the device"
From printing back trace from dummy .flush hook in our driver -
Normal exit (process terminates on it's own)
[Â 295.586130 <ÂÂÂ 0.000006>]Â dump_stack+0x5c/0x78
[Â 295.586273 <ÂÂÂ 0.000143>]Â my_flush+0xa/0x10 [amdgpu]
[Â 295.586283 <ÂÂÂ 0.000010>]Â filp_close+0x4a/0x90
[Â 295.586288 <ÂÂÂ 0.000005>]Â SyS_close+0x2d/0x60
[Â 295.586295 <ÂÂÂ 0.000003>]Â do_syscall_64+0xee/0x270
Exit triggered by fatal signal (not handled signal, including SIGKILL)
[Â 356.551456 <ÂÂÂ 0.000008>]Â dump_stack+0x5c/0x78
[Â 356.551592 <ÂÂÂ 0.000136>]Â my_flush+0xa/0x10 [amdgpu]
[Â 356.551597 <ÂÂÂ 0.000005>]Â filp_close+0x4a/0x90
[Â 356.551605 <ÂÂÂ 0.000008>]Â put_files_struct+0xaf/0x120
[Â 356.551615 <ÂÂÂ 0.000010>]Â do_exit+0x468/0x1280
[Â 356.551669 <ÂÂÂ 0.000009>]Â do_group_exit+0x89/0x140
[Â 356.551679 <ÂÂÂ 0.000010>]Â get_signal+0x375/0x8f0
[Â 356.551696 <ÂÂÂ 0.000017>]Â do_signal+0x79/0xaa0
[Â 356.551756 <ÂÂÂ 0.000014>] exit_to_usermode_loop+0x83/0xd0
[Â 356.551764 <ÂÂÂ 0.000008>]Â do_syscall_64+0x244/0x270
So as it was said here before, it will be called for every process closing his FD to the file.
But again, I don't quire see yet what we earn by using .flush, is it that you force every process holding reference to DRM file not
die until all jobs are submitted to HW (as long as the process not being killed by a signal) ?
Andrey