Re: [dm-devel] LVM snapshot broke between 4.14 and 4.16

From: Linus Torvalds
Date: Fri Aug 03 2018 - 16:43:07 EST


On Fri, Aug 3, 2018 at 1:08 PM Alasdair G Kergon <agk@xxxxxxxxxx> wrote:
>
> If taking this approach, it might be better to use the current version
> i.e. where we add the kernel-side fix. IOW anything compiling against
> a uapi header taken from a kernel up to now will see the old behaviour,
> but anything newer (after we next increment the kernel dm version) will
> be required to change its userspace code if it has this problem in it.

No, please don't do things like that.

It's a nightmare. It just means that different people randomly have
different behavior depending on what development environment they had,
which is horrible for any situation where you do cross-builds, but
it's also horrible from a testing standpoint.

And it's a *huge* nightmare for stable releases and backporting fixes.
Absolutely *no* user space shjould ever care about kernel versions. If
you want to use a new system call or something, just *use* if, and
then if it fails, fall back to the old one. Never ever check "what's
the kernel version" or anything like that.

Instead, what I suggest we do is:

Case 1: if a user program notice that a new kernel breaks something, then

(a) report it to kernel people with a test-case

(b) if you notice that the *reason* a new kernel broke something is
because you had a bug and you can just fix that bug and it will always
work on all kernels, by all means just fix the bug. Obviously you
don't want to just keep buggy code around just because it was found by
a kernel change

(c) but even if (b) happens, do that report. In fact, you now have a
perfect test-case for the kernel people you report it to: you can tell
them *exactly* what changed, and what the two situations were.

(d) and if whoever you reported the kernel breakage to doesn't seem
to take it seriously, just escalate it to me.

Note that for the (d) case, there are situations where even _I_ won't
necessarily take it seriously. If you're the only user of some
program, I may just go "Oh, you already fixed your own problem, I
don't need to worry". Or if it turns out that the breakage wasn't
"user flow", but some test-suite regression, I will tend to ignore
those. Test suites are by definition for finding _behavior_, but
behavior can change - it's actual _breakage_ I worry about.

But also notice how none of that "case 1" has any versioning related
to it. Yes, you'll have old versions of the user program before the
bug was fixed, but they will *not* look at kernel versions, and the
new versions certainly shouldn't either.

Case 2: the kernel side. We get that breakage reported, but it turns
out that different versions of the workflow act differently, and maybe
some versions depend on the new behavior, and some depend on the old
behavior.

And yes, this happens.

We have been in the situation where we get a bug-report for something
we changed three years ago, and by now, all modern user space depends
on the new "fixed" behavior, but some embedded user who hadn't
upgraded a kernel in years is unhappy.

It's happened several times, and if it really is "it's been years",
even I will go "tough luck, I guess you're stuck on your old kernel".
Because we can't fix it.

But in a situation like this, where we really want to encourage new
behavior but we have somebody who reported the breakage in a timely
manner (ie it's fairly recent), _and_ it's a system tool like lvm2
that actually gives us a version number, at that point the *kernel*
might decide "this is an old binary, I will now use some versioning to
fall back on old behavior".

Because for the kernel, the compatibility really is a #1 concern, and
if we have to check version numbers or infer compatibility issues some
other way, we'll do it.

But that "different behavior for different application versions" is
something we generally avoid at all costs. We do have a couple of
cases where we deduce what people want based on behavior. But it is
very very rare, and we generally want to avoid it if there is _any_
other model for it.

So even in case 2, we do try to avoid versioning. More often we add a
new flag, and say "hey, if you want the new behavior, use the new flag
to say so". Not versioning, but explicit "I want the new behavior"

Not all system calls take flags, of course.

Linus