spdxcheck: python git module considered harmful (was RE: [PATCH] scripts/spdxcheck: Limit the scope of git.Repo)

From: Bird, Tim
Date: Tue Apr 08 2025 - 14:14:21 EST


> -----Original Message-----
> From: Gon Solo <gonsolo@xxxxxxxxx>
> It's a known problem:
> https://github.com/gitpython-developers/GitPython/issues/2003
> https://github.com/python/cpython/issues/118761#issuecomment-2661504264
>

For what it's worth, I've always been a bit skeptical of the use of the python git module
in spdxcheck.py. Its use makes it impossible to use spdxcheck on a kernel source tree
from a tarball (ie, on source not inside a git repo). Also, from what I can see in spdxcheck.py,
the way it's used is just to get the top directories for either the LICENSES dir,
the top dir of the kernel source tree, or the directory to scan passed on the
spdxcheck.py command line, and then to use the repo.traverse() function on said directory.

This ends up excluding any files in the source directory tree that are not checked
into git yet, silently skipping them (which I've run into before when using the tool).

I think the code could be relatively easily refactored to eliminate the use of the git
module, to overcome these issues. I'm not sure if removing the module would
eliminate the yield operation (used inside repo.traverse()), which seems to be causing the
problem found here. IMHO, in my experience when using python it is helpful
to use as few non-core modules as possible, because they tend to break like this
occasionally.

Let me know if anyone objects to me working up a refactoring of spdxcheck.py
eliminating the use of the python 'git' module, and submitting it for review.

Thanks,
-- Tim