Re: Stable/devel policy - was Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3

From: Ingo Molnar
Date: Sun Jun 11 2006 - 03:32:06 EST



* Linus Torvalds <torvalds@xxxxxxxx> wrote:

> And even more interestingly (at least to me), the question might
> become one of "how does that affect the tools and build and
> configuration infrastructure", and just the general flow of
> development.
>
> I don't think one or two filesystems (and a few drivers) splitting is
> anythign new, but if this ends up becoming _more_ common, maybe that
> implies a new model entirely..

at least for core kernel stuff, it's hard to split things in any
manageable way (as you mentioned it as well) - so higher flux is
inevitable.

So what i've been focusing on more in the past year or so is to enable
the core kernel to take more development flux, via kernel features.

Instead of adding more features to the kernel, i'm quite interested in
seeing more technologies that make a higher development flux safer: to
make the kernel more debuggable, to make bugs more reportable for users,
to make the effects of bugs less harmful, and to make the kernel itself
notice more bugs by itself.

To be able to handle a higher development flux in core code, i think we
need the following policies wrt. core kernel changes:

- More code consolidation between architectures and subsystems.

Core kernel changes impact "non-mainstream" architectures the most -
while some of our best technologies root from non-mainstream
technologies. So it's a net loss to only concentrate on the
mainstream, because developer and technology distribution does not
follow user distribution.

The generic irq subsystem, spinlock and semaphore/mutex consolidation
are all efforts in this direction. I consider the Generic Time Of Day
(GTOD) effort a similarly important item, for the same reasons. There
are other good examples too, for example klibc is a good step towards
a more consolidated boot process. The Xen subarch work triggers
consolidation too - etc. Andrew's policy of "you must not break _any_
architecture in -mm" is very important too.

And we should do consolidation even in cases where there's some
minimal runtime cost. Being able to handle higher flux is more
important than getting the last cycle out of the system. This does
not mean we should reject patches that do get those last cycles, this
only means we should not reject consolidation patches on the grounds
that they _lose_ a few cycles. I dont think this is a common problem
for consolidation projects right now - but it could happen in the
future.

- Even more cleanups.

We always preferred cleanups but it now becomes critical: i strongly
believe that cleanups must take precedence over feature work. [with a
few rare and temporary exceptions perhaps, like hardware-enablement
or really critical features.] It's much easier to spot bugs in clean
code, plus it's much easier for automated correctness validators to
find bugs in clean code.

(My own examples here include spinlock-init cleanups, which directly
enabled things like the lock validator. But pure code cleanups apply
too. )

- More automated correctness-checking tools and kernel features.

While the preferred mode of avoiding bugs should be a clean
design and clean code, higher flux introduces higher noise and bugs
are inevitable. So the importance of automated tools (both static and
dynamic analysis) increased.

Sparse annotations are one good example. My own examples here are the
lock validator, the mutex debugging code, the consolidated
spinlock debugging code. Some of these are direct feature-enablers:
for example the smp_processor_id() debugging code directly enabled a
safe and painless migration to PREEMPT_BKL. One nice feature in the
works that can find hard-to-spot bugs is kmemleak.

- Coding style police!

With higher development flux it is becoming even more important for
kernel developers to review other developer's work. But that is very
hard if the coding style varies too much. This is a fundamentally
human problem, and the only sane solution is brutal: the _strict_
Linus coding style must be used in all high-flux subsystems.

- More debuggability, reportability.

In this area we still suck quite a bit, and this affects userspace
too: currently we have nothing equivalent to things like Dr Watson,
in Linux most of the info about the first userspace crash almost
always gets lost! (and even afterwards, once debug packages are
downloaded and the app is run in gdb, it's still too painful for the
user, so we lose lots of feedback.)

Some of the GUIs try to do something about this and automate crash
reporting, but it doesnt cover most of the app crashes and userspace
clearly needs kernel help, because ptrace is too inflexible for this
purpose. (help is on the way though, there's a next-gen ptrace
project that solves these problems very cleanly.)

There are a number of important projects going on in this area - for
example the dwarf unwinder for x86_64 to improve the quality of
kernel oopses, and kgdb (or bits of NLKD) if it gets clean enough.

my own impression is that things are going in the right direction, but
that there should be more awareness of these principles. I think if we
add a couple of more key technologies then we can take the higher kernel
development flux just fine, without compromising quality. Even though
Linux has lots of developers, we should be more economic with that
development power and should waste less of that on unnecessarily complex
debugging tasks.

I do consider the forking of a subsystem the "easy way out" - the hard
and more correct approach is i think to turn every drastic rewrite into
small manageable steps. That's much easier said than done, and it's
sometimes 10 times the work but it's alot safer - and the end result is
often wildly different (and alot cleaner!) from what one would do via a
drastic rewrite. A dumb 'cp -a' copying of a subsystem will preserve
most of the legacies and architectural inefficiencies. Even an
intelligent drastic rewrite preserves most of the legacies - there's
just so much of change users can take at once, and _eventually_ a new
subsystem has to be exposed to real users - at which point the
compatibility constraints apply again. I have yet to see a single case
of hard physical necessity to throw away an old subsystem due to
legacies. I think the prime example to follow is how Al Viro works -
he's beein maintaining the VFS for many years without having to
duplicate functionality, without breaking the world, but he still
managed to turn the VFS upside down, inside out, in small, manageable
steps. It _is_ possible in almost every case, for all but the most
spaghetti pieces of code.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/