Re: [PATCH v5 1/4] usb: dbc: early driver for xhci debug capability

From: Ingo Molnar
Date: Fri Jan 27 2017 - 01:51:16 EST

* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Thu, Jan 26, 2017 at 05:01:05PM +0100, Ingo Molnar wrote:
> > > >
> > > > I.e. if there's any polling component then it would be reasonable to add an error
> > > > component: poll the status and if it goes 'disconnected' then disable early-printk
> > > > altogether in this case and trigger an emergency printk() so that there's chance
> > > > that the user notices [if the system does not misbehave otherwise].
> > >
> > > That'll be fun when printk() == early_printk() :-)
> >
> > My suggestion would be to just print into the printk buffer directly in this case,
> > without console output - the developer will notice it in 'dmesg'.
> When you map printk() onto early_printk() dmesg will be empty, there
> will be nothing there, and therefore no reason what so ever to look
> there.

Unless you want a third layer of a console driver putting the debug message into
dmesg isn't all that bad of a solution.

Let's admit it: something like USB that involves external pieces of hardware
_does_ have failure modes, and troubleshooting messages instead of indefinite
hangs are obviously more robust.

> I certainly don't ever look there.

You'll have to teach yourself that if the box boots up fine but there are no
messages whatsoever from the early-printk console that you'll need to look at
dmesg output or the syslog for more clues.

This should not be a common occurrance in any case - but when it happens it's very
useful to have diagnostic messages. I don't think this is a controversial point in
any fashion.

> Note that the printk buffer itself is a major part of why printk sucks donkey
> balls. Not to mention that you really cannot have an early_printk()
> implementation that depends on printk().

There are several easy solutions to do that, my favorite would be to put it into
the printk buffer totally unlocked. When your early-printk is active it's unused
and in the end it's a known data structure after all:

* Just zap whatever's in the printk buffer and put your emergency message into
* it, prominently. No locking, no worries - don't generate emergency messages
* while printk is active and syslogd is running - this facility is a poor man's
* fallback printk() when early-printk has taken over all kernel logging:
void printk_emergency_puts(const char *str)
struct printk_log *msg, *msg_end;

msg = log_buf;

memset(msg, 0, sizeof(*msg));
msg.text_len = strlen(str);

msg_end = (void *)msg + sizeof(*msg) + msg->text_len;

/* Zero ->len denotes end of log buffer: */
memset(msg_end, 0, sizeof(*msg_end));

snprintf(ptr, str);

printk_emergency_puts"earlyprintk emergency: Hardware timed out, shutting down. Fix your debug cable?\n");

(Or so - totally untested, some details might be wrong.)

But yes, I agree with your wider point, I just looked at kernel/printk/printk.c
and puked. Why did we merge that crappy piece of binary logging code, when we
already have two other binary logging facilities in the kernel already, both of
them better and cleaner than this?? Why did we mess up our nicely readable,
simple, reliable ASCII log buffer printk code? :-(

> > > I myself wouldn't mind the system getting stuck until the link is
> > > re-established. My own damn fault for taking that cable out etc.
> >
> > That's fine too, although beyond the obvious "yanked the cable without
> > realizing it" case there are corner cases where usability is increased
> > massively if the kernel is more proactive about error conditions: for example
> > there are sub-standard USB cables and there are too long USB pathways from
> > overloaded USB hubs which can result in intermittent behavior, etc.
> >
> > A clear diagnostic message in 'dmesg' that the USB host controller is unhappy
> > about the USB-debug dongle device is a _lot_ more useful when troubleshooting
> > such problems than the occasional weird, non-deterministic hang...
> Sure, I'm just not sure what or where makes sense.
> If your serial cable is bad you notice because you don't receive the right
> amount of characters and or stuff gets mangled. You chuck the cable and get a
> new one.
> I think the most important part is re-establishing the link when the cable gets
> re-inserted. Maybe we should just drop all characters written when there's no
> link and leave it at that, same as serial.

That would be fine with me too - but even in this case there should be a stat
counter somewhere (in /proc or /debug) that counts the number of characters
dropped. Maybe that file could also display an emergency string - avoiding the
interaction with the printk buffer.

We can do better than passive-aggressive logging behavior...