Re: [RFC/PATCH] Documentation of kernel messages
From: holzheu
Date: Wed Jun 13 2007 - 14:15:03 EST
Hi Valdis,
On Wed, 2007-06-13 at 12:50 -0400, Valdis.Kletnieks@xxxxxx wrote:
> On Wed, 13 Jun 2007 17:06:57 +0200, holzheu said:
> > They are used to that, because all other operating systems on that
> > platform like z/OS, z/VM or z/VSE have message catalogs with detailed
> > descriptions about the semantics of the messages.
>
> 25 years ago, I did OS/MVT and OS/VS1 for a living, so I know *all* about
> the infamous "What does IEF507E mean again?"...
:-)
> > In general we think, that also for Linux it is a good thing to have
> > documentation for the most important kernel/driver messages. Even
> > kernel hackers not always are aware of the meaning of kernel messages
> > for components, which they don't know in detail. Most of the messages
> > are self explaining but sometimes you get something like "Clocksource
> > tsc unstable (delta = 7304132729 ns)" and you wonder if your system is
> > going to explode.
>
> This is probably best addressed by cleaning up the actual messages so they're
> a bit more informative.
Of course that would be good, too.
But I think, that we sometimes have the dilemma, that we want to keep
the printks short, but also want to provide as much information as
possible.
If the information is to big for the printk itself, because you would
need 10 lines to explain what happened, wouldn't it be good to have a
place where to put that information?
> > New macros KMSG_ERR(), KMSG_WARN(), etc. are defined, which have to be
> > used in printk. These macros have as parameter the message number and
> > are using a per c-file defined macro KMSG_COMPONENT.
>
> Gaak. *NO*.
>
> The *only* reason that the MVS and VM message catalogs worked at all is
> because each component had a message repository that went across *all* the
> source files - the instant you saw IEFnnns, you knew that IEF covered the
> job scheduler, nnn was a *unique* number, and s was a Severe/Warning/Info
> flag. IGG was always data management, and so on. This breaks horribly if
> you have 2 C files that define subtly different KMSG_COMPONENT values (or
> even worse, 2 or more duplicates).
>
> [/usr/src/linux-2.6.22-rc4-mm2] find . -name '*.c' | wc -l
> 9959
> [/usr/src/linux-2.6.22-rc4-mm2] find . -name '*.h' | wc -l
> 9933
> [/usr/src/linux-2.6.22-rc4-mm2] find . -type d | wc -l
> 1736
>
> You plan to maintain message uniqueness how?
> [/usr/src/linux-2.6.22-rc4-mm2]1 find . -name '*.c' | sed -r 's?.*/([^/]*)?\1?' | sort | uniq -c | sort -nr | head
> 105 setup.c
> 90 irq.c
> 66 time.c
> 58 init.c
> 50 inode.c
> 39 io.c
> 38 pci.c
> 37 file.c
> 32 signal.c
> 32 ptrace.c
>
> Looks like you're going to have to embed a lot of the path in that KMSG_COMPONENT
> to make it unique - and you want to keep that message under 80 or so chars total.
>
For each kernel component, like a device driver, we could have one
KMSG_COMPONENT (e.g. "acpi", "pci", etc). Within that component the
message ids have to be unique. A tool could check, if messages are
unique within the kernel sources.
We could use something like a Documentation/kmsg-components file with a
list of all component names using KMSG printks.
> > /**
> > * message
> > * @0: device number of device.
> > *
> > * Description:
> > * An operation has been performed on the msgtest device, but the
> > * device has not been set online. Therefore the operation failed
>
> If you don't understand 'Device /dev/foo offline', this description
> doesn't help any. And that's true for *most* of the kernel messages
> already - if you don't understand the message already, a paragraph
> explanation isn't going to help much. Consider the average OOPS
> message, which contains stuff like 'EIP=0x..'. Telling the user that
> EIP means Execution Instruction Pointer isn't likely to help - if they
> knew what the pointer *did*, they'd probably already know EIP.
I agree with you, that most of the kernel messages do not need further
documentation. But I am convinced, that there are plenty of printks,
where additional documentation would be helpful.
> > *
> > * User Response:
> > * Operator should set device online.
> > * Issue "chccwdev -e <device number>".
>
> And this is where the weakness of this scheme *really* hits. I've actually run
> into cases where an operator followed the listed "Operator Response" for a
> "device offline", and issued a 'VARY 0C0,ONLINE'. And then we got a flood of
> I/O errors because the previous shift downed the device because it was having
> issues. The response the operator *should* have done is "assign a different
> tape drive, like, oh maybe the operational ones at 0C1 through 0C4"...
I can understand your frustration here. But that's a general problem
with documentation. You never can foresee everything.
But should this mean, that we shouldn't document anything?
Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/