[...]
Could the OOPS happen at resume time? Might it be that the kernelThe OOPS happens because the rq member of entity is NULL inWell that is clearly not the correct approach to fixing this. So clearly
drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
In drm_sched_entity_select_rq(), the code considers that
drb_sched_pick_best() might return a NULL value. When NULL, it assigns
NULL to entity->rq even if it had a non-NULL value before.
drm_sched_job_arm() does not deal with entities having a rq of NULL.
Fix this by leaving the entity on the engine it was instead of
assigning a NULL to its run queue member.
a NAK from my side.
The real question is why is amdgpu_cs_ioctl() called when all of
userspace should be frozen?
Regards,
Christian.
activates user-space
before all the components of the GPU finished their wakeup?
Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?
Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
suggested change improves the code. drm_sched_pick_best() can return NULL.
drm_sched_entity_select_rq() can handle the NULL (partially).
drm_sched_job_arm() crashes on an entity that has rq set to NULL.
The handling of NULL values is half-baked.
In my opinion, you should define if drm_sched_pick_best() may put a NULL into
rq. If your answer is yes, it might put a NULL there; then, there should be a
BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
If your answer is no, the BUG_ON() should be in drm_sched_pick_best().
That helps guys with zero domain knowledge, like me, to figure out how
this is all
supposed to work.
best regards,
Philipp