Re: spdxcheck: python git module considered harmful (was RE: [PATCH] scripts/spdxcheck: Limit the scope of git.Repo)

From: Thomas Gleixner
Date: Tue Apr 08 2025 - 19:41:20 EST


On Tue, Apr 08 2025 at 17:34, Tim Bird wrote:
>> -----Original Message-----
> For what it's worth, I've always been a bit skeptical of the use of the python git module
> in spdxcheck.py. Its use makes it impossible to use spdxcheck on a kernel source tree
> from a tarball (ie, on source not inside a git repo). Also, from what I can see in spdxcheck.py,
> the way it's used is just to get the top directories for either the LICENSES dir,
> the top dir of the kernel source tree, or the directory to scan passed on the
> spdxcheck.py command line, and then to use the repo.traverse() function on said directory.
>
> This ends up excluding any files in the source directory tree that are not checked
> into git yet, silently skipping them (which I've run into before when
> using the tool).

The exactly same problem exists the other way round. Run an
unconstrained version of spdxcheck on a dirty source tree with lots of
leftovers, then it scans nonsense all the way instead of skipping some
not yet git tracked files.

The easiest way for me to achieve that was using git to exclude all of
the irrelevant noise, which I still consider to be a reasonable design
decision.

And yes, it ignores not yet tracked files, but if you want to check
them, then it's easy enough to commit them temporarily or provide a
dedicated file target to the tools, which ignores git.

> I think the code could be relatively easily refactored to eliminate the use of the git
> module, to overcome these issues. I'm not sure if removing the module would
> eliminate the yield operation (used inside repo.traverse()), which seems to be causing the
> problem found here. IMHO, in my experience when using python it is helpful
> to use as few non-core modules as possible, because they tend to break like this
> occasionally.
>
> Let me know if anyone objects to me working up a refactoring of spdxcheck.py
> eliminating the use of the python 'git' module, and submitting it for review.

I have no objections at all as long as it gives the same result of not
trying to scan random artifacts which might sit around in a source tree.

But not for the price that I have to create a tarball or a pristine
checked out tree first to run it. That'd be a usability regression to
begin with.

Good luck for coming up with a clever and clean solution for that!

Just for the record: I rather wish that people would contribute to
eliminate the remaining 17% (15397 files) which do not have SPDX
identifiers than complaining about the trivial to solve short-comings of
the tool, which was written to help this effort and to make sure that it
does not degrade.

Thanks,

tglx