Re: [PATCH] scripts/setlocalversion: make git describe output more reliable

From: Rasmus Villemoes
Date: Wed Sep 16 2020 - 16:49:36 EST


On 16/09/2020 16.28, Masahiro Yamada wrote:
> On Fri, Sep 11, 2020 at 5:28 PM Rasmus Villemoes
> <linux@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> On 10/09/2020 21.05, Brian Norris wrote:
>>> On Thu, Sep 10, 2020 at 7:35 AM Masahiro Yamada <masahiroy@xxxxxxxxxx> wrote:
>>>> On Thu, Sep 10, 2020 at 8:57 PM Rasmus Villemoes
>>>> <linux@xxxxxxxxxxxxxxxxxx> wrote:
>>>>> So in order to avoid `uname -a` output relying on such random details
>>>>> of the build environment which are rather hard to ensure are
>>>>> consistent between developers and buildbots, use an explicit
>>>>> --abbrev=15 option (and for consistency, also use rev-parse --short=15
>>>>> for the unlikely case of no signed tags being usable).
>>>
>>> For the patch:
>>>
>>> Reviewed-by: Brian Norris <briannorris@xxxxxxxxxxxx>
>>>
>>>> I agree that any randomness should be avoided.
>>>>
>>>> My question is, do we need 15-digits?
>>> ...
>>>> So, I think the conflict happens
>>>> only when we have two commits that start with the same 7-digits
>>>> in the _same_ release. Is this correct?
>>
>> No.
>>
>>> For git-describe (the case where we have a tag to base off):
>>> "use <n> digits, or as many digits as needed to form a unique object name"
>>
>> Yes, the abbreviated hash that `git describe` produces is unique among
>> all objects (and objects are more than just commits) in the current
>> repo, so what matters for probability-of-collision is the total number
>> of objects - linus.git itself probably grows ~60000 objects per release.
>>
>> As for "do we need 15 digits", well, in theory the next time I merge the
>> next rt-stable tag into our kernel I could end up with a commit that
>> matches some existing object in the first 33 hex chars at which point I
>> should have chosen 34 - but of course that's so unlikely that it's
>> irrelevant.
>>
>> But the upshot of that is that there really is no objective answer to
>> "how many digits do we need", so it becomes a tradeoff between "low
>> enough probability that anyone anywhere in the next few years would have
>> needed more than X when building their own kernel" and readability of
>> `uname -r` etc. So I decided somewhat arbitrarily that each time one
>> rolls a new release, there should be less than 1e-9 probability that
>> HEAD collides with some other object when abbreviated to X hexchars.
>> X=12 doesn't pass that criteria even when the repo has only 10M objects
>> (and, current linus.git already has objects that need 12 to be unique,
>> so such collisions do happen in practice, though of course very rarely).
>> 13 and 14 are just weird numbers, so I ended with 15, which also allows
>> many many more objects in the repo before the probability crosses that
>> 1e-9 threshold.
>>
>
>
> This is because you use the output from git as-is.
>
> Why are you still trying to rely on such obscure behavior of git?
>
>
> It is pretty easy to get the fixed number of digits reliably.
>
> For example,
> git rev-parse --verify HEAD 2>/dev/null | cut -c1-7
>
>
> It always returns a 7-digits hash,
> and two different people will get the same result for sure.
>
> Your solution, --short=15, means "at least 15",
> which still contains ambiguity.
>
>
>
> As I already noted, the kernelrelease string is
> constructed in this format:
>
> <kernel-version>-<number-of-commits-on-top>-<abbreviated-commit-hash>
>
>
> For example, if I enable CONFIG_LOCALVERSION_AUTO=y
> in today's Linus tree, I got this:
>
> 5.9.0-rc5-00005-gfc4f28bb3daf
>
>
> What if the number of digits were 7?
>
> 5.9.0-rc5-00005-gfc4f28b
>
> I see no problem here.

The problem is that currently, the build relies on things which cannot
easily be controlled or reproduced between different developers: Apart
from git version (which is reasonable to mandate is the same, just like
one must use same compiler, binutils etc. to get binary reproducible
output), it depends on whether ~/.gitconfig has a core.abbrev setting -
and even worse, it depends on the _total number of objects in the source
git repo_, i.e. something that has nothing to do with what is currently
in the work tree at all.

And that leads to real bugs when one builds external modules that end up
in one directory in the rootfs, while `uname -r` tells modprobe to look
in some other directory (differing in the length of the abbreviated hash).

> Even if we have another object that starts with "fc4f28b",
> the kernelrelease string is still unique thanks to the
> <kernel-version>-<number-of-commits-on-top> prefix.
>
> Why do we care about the uniqueness of the abbreviated
> hash in the whole git history?

Because when I ask a customer "what kernel are you running", and they
tell me "4.19.45-rt67-00567-123abc8", I prefer that 123abc8 uniquely
identifies the right commit, instead of me having to dig around each of
the commits starting with that prefix and see which one of them matches
"is exactly 567 commits from 4.19.45-rt67". 7 hexchars is nowhere near
enough for that today, which is why Linus himself did that "auto-tune
abbrev based on repo size" patch for git.git.

But what I mostly care about is getting a consistent result. And yes,
you're right that the porcelain command 'git describe' could end up
using something more than 15 digits (though that's extremely unlikely).
So if you prefer, I can try to rewrite the logic purely in terms of
plumbing commands. But that's a much more invasive patch, and one would
obviously lose the guarantee of the abbreviation being unique among
current git objects.

Alternatively, would you consider a Kconfig knob, LOCALVERSION_ABBREV?

Rasmus