On 29/09/2022 15:57, Christian König wrote:
Am 29.09.22 um 16:53 schrieb Steven Price:Panfrost isn't using drm_sched_resubmit_jobs_ext() directly but via
On 14/09/2022 17:43, Arvind Yadav wrote:Well, first of all please absolutely don't use
Using the parent fence instead of the finished fenceI'm able to reproduce crashes on Panfrost and I believe this commit is
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.
the cause. Specifically it's possible for job->s_fence->parent to be
NULL.
The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if
the run_jobs() callback returns an error it will set s_fence->parent to
NULL after signalling s_fence->finished:
fence = sched->ops->run_job(s_job);I don't understand the reasoning behind this change, but it doesn't seem
i++;
if (IS_ERR_OR_NULL(fence)) {
if (IS_ERR(fence))
dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
s_job->s_fence->parent = NULL;
right to be using the parent fence when we have code which can be
setting that pointer to NULL.
Since I don't understand the reasoning my only suggestion is to revert
this patch (and potentially the dependent patch "dma-buf: Check status
of enable-signaling bit on debug"?).
Can anyone suggest a better fix?
drm_sched_resubmit_jobs_ext()!
drm_sched_resubmit_jobs().
It was an extremely bad idea in amdgpu to approach GPU by re-submittingPanfrost has an interesting feature where it's possible to rescue a job
jobs and it was an even worse idea to push this into the scheduler.
The design of dma_fence is that you submit that once and *only* once and
then get a result for this submission. If re-submission is desirable it
should be done in userspace or at least higher levels.
during a GPU reset. Because jobs are queued on the GPU if the job hasn't
actually started executing then it's quite possible to safely resubmit
it from the kernel driver and user space doesn't need to be involved.
The benefit of this is if another process has hung the GPU that
processes jobs can be killed off without affecting any other innocent
processes.
One option would be to hide all this from the scheduler, but I can't see
how to do that without also hiding the actual reset from the scheduler.
Admittedly at the moment Panfrost is far too aggressive at resetting and
will perform a GPU reset in conditions where it's completely
unnecessary. There's work to do there but I haven't had the time to look
at it yet.
Apart from that, yes a NULL check is missing here but that should beWhat I'm struggling to get my head round is whether it's correct to
trivial to fix.
always treat the job as signalled just because s_fence->parent is NULL?
Thanks,
Steve
Thanks,_______________________________________________
Christian.
Thanks,
Steve
Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
changes in v1,v2 - Enable signaling for finished fence in sche_main()
is removed
---
drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index e0ab14e0fb6b..2ac28ad11432 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct
drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
struct drm_sched_job, list);
- if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
+ if (job && dma_fence_is_signaled(job->s_fence->parent)) {
/* remove job from pending_list */
list_del_init(&job->list);
@@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct
drm_gpu_scheduler *sched)
if (next) {
next->s_fence->scheduled.timestamp =
- job->s_fence->finished.timestamp;
+ job->s_fence->parent->timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}
Linaro-mm-sig mailing list -- linaro-mm-sig@xxxxxxxxxxxxxxxx
To unsubscribe send an email to linaro-mm-sig-leave@xxxxxxxxxxxxxxxx