Re: [PATCH v2 1/2] drm/sched: memset() 'job' in drm_sched_job_init()

From: Tvrtko Ursulin
Date: Fri Sep 13 2024 - 09:31:24 EST



On 13/09/2024 13:30, Philipp Stanner wrote:
On Fri, 2024-09-13 at 12:56 +0100, Tvrtko Ursulin wrote:

Hi,

On 28/08/2024 10:41, Philipp Stanner wrote:
drm_sched_job_init() has no control over how users allocate struct
drm_sched_job. Unfortunately, the function can also not set some
struct
members such as job->sched.

job->sched usage from within looks like a bug. But not related to the
memset you add.

For this one something like this looks easiest for a start:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index ab53ab486fe6..877113b01af2 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -788,7 +788,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
                  * or worse--a blank screen--leave a trail in the
                  * logs, so this can be debugged easier.
                  */
-               drm_err(job->sched, "%s: entity has no rq!\n",
__func__);
+               pr_err("%s: entity has no rq!\n", __func__);
                 return -ENOENT;
         }

Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to
variable
number of run-queues")
Cc: <stable@xxxxxxxxxxxxxxx> # v6.7+

Danilo and I already solved that:

https://lore.kernel.org/all/20240827074521.12828-2-pstanner@xxxxxxxxxx/

Ah.. I saw the link to this in your maintainership thread and superficially assumed it is among the pending stuff. All good.


This could theoretically lead to UB by users dereferencing the
struct's
pointer members too early.

Hmm if drm_sched_job_init returned an error callers should not
dereference anything. What was actually the issue you were debugging?

I was learning about the scheduler, wrote a dummy driver and had
awkward behavior. Turned out it was this pointer not being initialized.
I would have seen it immediately if it were NULL.

The actual issue was and is IMO that a function called
drm_sched_job_init() initializes the job. But it doesn't, it only
partially initializes it. Only after drm_sched_job_arm() ran you're
actually ready to go.

In my experience one good approach when developing stuff is to have the various kernel debugging aids enabled. Lockdep, SLAB debugging, memory poisoning, kfence.. Then if you were allocating your job without GFP_ZERO, _and_ dereferencing something too early out of misunderstanding of the API, you would get something obvious in the oops and not a random pointer.

Which also applies to various CI systems, such as the Intel's one which already runs a debug kernel and a lot of these mistakes are caught instantly.

Adding a memset is I think not the best solution since it is very
likely
redundant to someone doing a kzalloc in the first place.

It is redundant in most cases, but it is effectively for free. I
measured the runtime with 1e6 jobs with and without memset and there
was no difference.

I guess if kzalloc and drm_sched_job_init() are close enough in time so that cachelines stays put, and depending how you measure, it may be hard to see but cost if still there.

For instance https://lore.kernel.org/amd-gfx/20240813140310.82706-1-tursulin@xxxxxxxxxx/ I can see with perf that both memsets are hotspots even when testing with glxgears and vsync off.

But I don't feel too strongly about this and there definitely is sense in initializing everything. Perhaps even instead of a memset we should use correct methods per field? Since in there we have spcs_node, atomic_t, ktime_t, dma_fence_cb (even in an annoying union), drm_sched_priority.. In an ideal world all those would have their initializers. But some don't so meh.

Regards,

Tvrtko

It is easier to debug such issues if these pointers are initialized
to
NULL, so dereferencing them causes a NULL pointer exception.
Accordingly, drm_sched_entity_init() does precisely that and
initializes
its struct with memset().

Initialize parameter "job" to 0 in drm_sched_job_init().

Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx>
---
No changes in v2.
---
  drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++
  1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 356c30fa24a8..b0c8ad10b419 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job
*job,
   return -EINVAL;
   }
+ /*
+ * We don't know for sure how the user has allocated.
Thus, zero the
+ * struct so that unallowed (i.e., too early) usage of
pointers that
+ * this function does not set is guaranteed to lead to a
NULL pointer
+ * exception instead of UB.
+ */
+ memset(job, 0, sizeof(*job));
+
   job->entity = entity;
   job->credits = credits;
   job->s_fence = drm_sched_fence_alloc(entity, owner);