Re: 2.6.16-rc4: known regressions

From: Linus Torvalds
Date: Wed Feb 22 2006 - 12:10:09 EST




On Wed, 22 Feb 2006, David Zeuthen wrote:
>
> Oh, you know, I don't think that's exactly how it works; HAL is pretty
> much at the mercy of what changes goes into the kernel. And, trust me,
> the changes we need to cope with from your so-called stable API are not
> so nice.

Why do you "cope"?

Start complaining. If kernel changes screw up something, COMPLAIN. Loudly.
They shouldn't.

> It also makes me release note that newer HAL releases require newer
> kernel and udev releases and that's alright.

It's _somewhat_ ok to have a well-defined one-way dependency. It's sad,
but inevitable sometimes.

For example, the kernel does have a dependency on the compiler used to
compile it. We try to avoid it as far as possible, but we've slowly been
updating it, first from 1.40 to 2.75 to 2.9x and now to 3.1. But the
kernel obviously shouldn't have any other run-time dependencies, because
everything else is "on top of" the kernel.

What is NOT ok is to have a two-way dependency. If user-space HAL code
depends on a new kernel, that's ok, although I suspect users would hope
that it wouldn't be "kernel of the week", but more a "kernel of the last
few months" thing.

But if you have a TWO-WAY dependency, you're screwed. That means that you
have to upgrade in lock-step, and that just IS NOT ACCEPTABLE. It's
horrible for the user, but even more importantly, it's horrible for
developers, because it means that you can't say "a bug happened" and do
things like try to narrow it down with bisection or similar.

> For just one example of API breaking see
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175998

So the kernel obviously shouldn't be just randomly changing the type
numbers around.

The _real_ bug seems to be that some people think it is OK to do this kind
of user-visible changes, without even considering the downstream, or
indeed, without even telling anybody else (like Andrew or me) about it.

> This breaks stuff for end users in a stable distribution. Not good.

Indeed. Not good at all.

And yes, some of it may be just HAL being a fragile mess, and some of it
may end up being just user-level code that must be made to be more robust
("I see a new type I don't understand" "Ok, assume a lowest common
denominator, and stop whining about it").

But a lot of it is definitely some kernel people being _waayy_ too
cavalier about userspace-visible changes.

> I think maintaining a stable syscall interface makes sense. Didn't you
> once say that only the syscall interface was supposed to be stable? Or
> was that Greg KH? I can't remember...

It's _not_ just system calls. It's any user-visible stuff. That very much
includes /proc, /sys, and any "kernel pipes" aka netlink etc bytestreams.

What is not stable is the _internal_ data structures. We break external
modules, and we sometimes break even in-kernel drivers etc with abandon,
if that is what it takes to fix something or make it prettier.

So fcntl and ioctl numbers etc are _inviolate_, because they are part of
the system interface. As is /proc and /sys. We don't change them just
because it's "convenient" to change them in the kernel.

If /sys needs an extended type to describe the command set of a device, we
do NOT just change an existing attribute in /sys.

> And I also think that breaking things like sysfs can be alright as long
> as you coordinate it with major users of it, e.g. udev and HAL.

The major users are USERS. Not developers. It doesn't help to "coordinate"
things, when what gets screwed is the end-user who no longer can upgrade
his kernel without worrying that something might break.

THIS IS WHY WE MUST MAKE THE KERNEL INTERFACES STABLE!

If users cannot upgrade their kernels safely, we will have two totally
unacceptable end results:

- users won't upgrade. They don't dare to, because it's too painful, and
they don't understand HAL or hotplug, or whatever.

If a developer cannot see that this is unacceptable, then that
developer is a nincompoop and needs to be educated.

- users upgrade, and generate bug reports and waste other developers time
because those other developers didn't realize that the HAL cabal had
decided that that breakage was "ok".

Or worse, they don't generate the bug reports, and then six months from
now, when they test again, and it's still broken, they generate a
really bad one ("it doesn't work") when everybody - including the HAL
cabal - has forgotten what it was all about.

Again, if a developer cannot see that this is unacceptable, then that
developer is not playing along, and needs to have his mental compass
re-oriented.

The fact is, regressions are about 10x more costly than fixing old bugs.
They cause problems downstream that just waste everybodys time. It's a
_hell_ of a lot more efficient to spend extra time to keep old interfaces
stable than it is to cause regressions.

> One day perhaps sysfs will be "just right" and you can mark it as being
> stable. I just don't think we're there yet. And I see no reason
> whatsoever to paint things as black and white as you do.

Nothing will _ever_ be "just right", and this has been going on too long.
We had better get our act together.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/