Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

From: Masahiro Yamada
Date: Wed Sep 16 2020 - 21:04:19 EST


On Thu, Sep 17, 2020 at 4:31 AM Rasmus Villemoes
<linux@xxxxxxxxxxxxxxxxxx> wrote:
>
> On 16/09/2020 20.01, Masahiro Yamada wrote:
> > On Thu, Sep 17, 2020 at 12:23 AM Rasmus Villemoes
> > <linux@xxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> On 16/09/2020 16.28, Masahiro Yamada wrote:
> >>> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> >>> <linux@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>
>
> >>> Why do we care about the uniqueness of the abbreviated
> >>> hash in the whole git history?
> >>
> >> Because when I ask a customer "what kernel are you running", and they
> >> tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
> >> identifies the right commit, instead of me having to dig around each of
> >> the commits starting with that prefix and see which one of them matches
> >> "is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
> >> enough for that today, which is why Linus himself did that "auto-tune
> >> abbrev based on repo size" patch for git.git.
> >
> > I like:
> >
> > git rev-parse --verify HEAD 2>/dev/null | cut -c1-15
> >
> > better than:
> >
> > git rev-parse --verify --short=15 HEAD 2>/dev/null
> >
> > The former produces the deterministic kernelrelease string.
> >
> >
> > But, let's reconsider if we need as long as 15-digits.
> >
> > There are a couple of kinds of objects in git: commit, tree, blob.
> >
> > I think you came up with 15-digits to ensure the uniqueness
> > in _all_ kinds of objects.
> >
> > But, when your customer says "4.19.45-rt67-00567-123abc8",
> > 123abc8 is apparently a commit instead of a tree or a blob.
> >
> >
> >
> > In the context of the kernel version, we can exclude
> > tree and blob objects.
> >
> > In other words, I think "grows ~60000 objects per release"
> > is somewhat over-estimating.
> >
> > We have ~15000 commits per release. So, the difference
> > is just 4x factor, though...
> >
> >
> >
> > BTW, I did some experiments, and I noticed
> > "git log" accepts a shorter hash
> > than "git show" does presumably because
> > "git log" only takes a commit.
> >
> >
> >
> > For example, "git show 06a0d" fails, but
> > "git log 06a0d" works.
> >
> >
> > masahiro@oscar:~/ref/linux$ git show 06a0d
> > error: short SHA1 06a0d is ambiguous
> > hint: The candidates are:
> > hint: 06a0df4d1b8b commit 2020-09-08 - fbcon: remove now unusued
> > 'softback_lines' cursor() argument
> > hint: 06a0d81911b3 tree
> > hint: 06a0dc5a84d2 tree
> > hint: 06a0d1947c77 blob
> > hint: 06a0df434249 blob
> > fatal: ambiguous argument '06a0d': unknown revision or path not in the
> > working tree.
> > Use '--' to separate paths from revisions, like this:
> > 'git <command> [<revision>...] -- [<file>...]'
> > masahiro@oscar:~/ref/linux$ git log --oneline -1 06a0d
> > 06a0df4d1b8b fbcon: remove now unusued 'softback_lines' cursor() argument
> >
> >
> >
> >
> > What is interesting is, if you prepend <tag>-<number-of-commits>-
> > to the abbreviated hash, "git show" accepts as short as a commit
> > "git log" does.
> >
> > masahiro@oscar:~/ref/linux$ git show v5.9-rc5-00002-g06a0d | head -1
> > commit 06a0df4d1b8b13b551668e47b11fd7629033b7df
> >
> >
> > I guess git cleverly understands it refers to a commit object
> > since "v5.9-rc5-00002-" prefix only makes sense
> > in the commit context.
> >
>
> Well, yes, but unfortunately only so far as applying the same "the user
> means a commit object" logic; git doesn't do anything to actually use
> the prefix to disambiguate. In fact, looking at a semi-random commit in
> 4.19-stable v4.19.114-7-g236c445eb352:
>
> $ git show 236c44
> error: short SHA1 236c44 is ambiguous
> hint: The candidates are:
> hint: 236c445eb352 commit 2020-03-13 - drm/bochs: downgrade
> pci_request_region failure from error to warning
> hint: 236c448cb6e7 commit 2011-09-08 - usbmon vs. tcpdump: fix dropped
> packet count
> hint: 236c445b1930 blob
>
> Now you're right that we get
>
> $ git show v4.19.114-7-g236c445 | head -n1
> commit 236c445eb3529aa7c976f8812513c3cb26d77e27
>
> so it knows we're not looking at that blob. But "git show
> v4.19.114-7-g236c44" still fails because there are two commits. Adding
> to the fun is that "git show v4.19.114-7-g236c448" actually shows the
> usbmon commit, which is old v3.2 stuff and certainly not a descendant of
> v4.19.114.


Oh, I did not notice that.
Maybe git can be improved to check the validity of
the 'git describe' form, but that is another story.



> I didn't actually know that git would even accept those prefixed strings
> as possible refspecs, and for a moment you had me hoping that git would
> actually do the disambiguation using that prefix, which would certainly
> make 7 hex chars enough; unfortunately that's not the case.
>
> > Keeping those above in mind, I believe 15-digits is too long.
>
> Well, as you indicated, commits are probably ~1/4 of all objects, i.e.
> half a hexchar worth of bits. So the commit/object distinction shouldn't
> matter very much for the choice of suitable fixed length.
>
> > So, I propose two options.
> >
> >
> > [1] 7 digits
> >
> > The abbreviated hash part may not uniquely identify
> > the commit. In that case, you need some extra git
> > operations to find out which one is it.
> >
> > As for the kernel build,
> > <kernel-version>-<number-of-commits>-<7-digits> is enough
> > to provide the unique kernelrelease string.
> >
> >
> >
> > [2] 12 digits
> >
> > This matches to the Fixes: tag policy specified in
> > Documentation/process/submitting-patches.rst
> >
> > The abbreviated hash part is very likely unique
> > in the commit object space.
> > (Of course, it is impossible to guarantee the uniqueness)
>
> I can live with 12. It would be extremely rare that I would have to do
> some extra git yoga to figure out which commit they actually mean. I
> think we can still use git describe to do the bulk of the work (the 'git
> rev-parse | cut ... is easy, but it's somewhat tedious to find the
> closest tag, then compute the #commits-on-top part), then just have the
> awk script used for pretty-printing (the %05d of the #commits-on-top)
> can probably also be used to chop the abbreviated sha1 at 12 chars,
> should git have needed to make it longer. Please see below for an
> updated patch+commit log.


LGTM.

Could you send it to ML as a patch?



> >> Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?
> >
> >
> > No, I do not think LOCALVERSION_ABBREV is a good idea.
>
> Neither do I; I considered it before sending the patch but decided that yes:
>
> > We should eliminate this problem
> > irrespective of the kernel configuration.
>
> Rasmus
>
> ====
>
> something like this, then?
>
>
> scripts/setlocalversion: make git describe output more reliable
>
> When building for an embedded target using Yocto, we're sometimes
> observing that the version string that gets built into vmlinux (and
> thus what uname -a reports) differs from the path under /lib/modules/
> where modules get installed in the rootfs, but only in the length of
> the -gabc123def suffix. Hence modprobe always fails.
>
> The problem is that Yocto has the concept of "sstate" (shared state),
> which allows different developers/buildbots/etc. to share build
> artifacts, based on a hash of all the metadata that went into building
> that artifact - and that metadata includes all dependencies (e.g. the
> compiler used etc.). That normally works quite well; usually a clean
> build (without using any sstate cache) done by one developer ends up
> being binary identical to a build done on another host. However, one
> thing that can cause two developers to end up with different builds
> [and thus make one's vmlinux package incompatible with the other's
> kernel-dev package], which is not captured by the metadata hashing, is
> this `git describe`: The output of that can be affected by
>
> (1) git version: before 2.11 git defaulted to a minimum of 7, since
> 2.11 (git.git commit e6c587) the default is dynamic based on the
> number of objects in the repo
> (2) hence even if both run the same git version, the output can differ
> based on how many remotes are being tracked (or just lots of local
> development branches or plain old garbage)
> (3) and of course somebody could have a core.abbrev config setting in
> ~/.gitconfig
>
> So in order to avoid `uname -a` output relying on such random details
> of the build environment which are rather hard to ensure are
> consistent between developers and buildbots, make sure the abbreviated
> sha1 always consists of exactly 12 hex characters. That is consistent
> with the current rule for -stable patches, and is almost always enough
> to identify the head commit unambigously - in the few cases where it
> does not, the v5.4.3-00021- prefix would certainly nail it down.
>
> diff --git a/scripts/setlocalversion b/scripts/setlocalversion
> index 20f2efd57b11..b51072167df1 100755
> --- a/scripts/setlocalversion
> +++ b/scripts/setlocalversion
> @@ -45,7 +45,7 @@ scm_version()
>
> # Check for git and a git repo.
> if test -z "$(git rev-parse --show-cdup 2>/dev/null)" &&
> - head=$(git rev-parse --verify --short HEAD 2>/dev/null); then
> + head=$(git rev-parse --verify HEAD 2>/dev/null); then
>
> # If we are at a tagged commit (like "v2.6.30-rc6"), we ignore
> # it, because this version is defined in the top level Makefile.
> @@ -59,11 +59,22 @@ scm_version()
> fi
> # If we are past a tagged commit (like
> # "v2.6.30-rc5-302-g72357d5"), we pretty print it.
> - if atag="$(git describe 2>/dev/null)"; then
> - echo "$atag" | awk -F- '{printf("-%05d-%s", $(NF-1),$(NF))}'
> -
> - # If we don't have a tag at all we print -g{commitish}.
> + #
> + # Ensure the abbreviated sha1 has exactly 12
> + # hex characters, to make the output
> + # independent of git version, local
> + # core.abbrev settings and/or total number of
> + # objects in the current repository - passing
> + # --abbrev=12 ensures a minimum of 12, and the
> + # awk substr() then picks the 'g' and first 12
> + # hex chars.
> + if atag="$(git describe --abbrev=12 2>/dev/null)"; then
> + echo "$atag" | awk -F- '{printf("-%05d-%s",
> $(NF-1),substr($(NF),0,13))}'
> +
> + # If we don't have a tag at all we print -g{commitish},
> + # again using exactly 12 hex chars.
> else
> + head="$(echo $head | cut -c1-12)"
> printf '%s%s' -g $head
> fi
> fi



--
Best Regards
Masahiro Yamada