Re: [RFC PATCH v4 9/9] printk: use a new ringbuffer implementation
From: Linus Torvalds
Date: Thu Aug 08 2019 - 20:21:32 EST
On Thu, Aug 8, 2019 at 4:45 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> Could we possibly have a magic value in some location that if it is
> set, we know right away that the buffer here has data from the last
> reboot, and we read it out into a safe location before we start using
> it again?
Right now I don't know how reliable RAM ends up being.
But with a small enough buffer I guess we could just do it
unconditionally and then let some debug tool in user space try to make
sense of it later.
More background for what I'm looking for: my hope for this is that we
can finally get the case of "undebuggable laptop hangs" logged with
something like this.
But laptops don't have reset buttons. They have "press the power
button for ten seconds, power turns off. Press it again, and power
comes on" reset sequences.
So DRAM power off for maybe 5 seconds? I've tried to find papers on
how well DRAM retention works (not very common: usually they happen
because you have some security researcher that wants to move a DIMM
and read it on another system, and some of them talk about using
freezing techniques to increase retention), and from what I've seen,
retention *should* be possible even for that kind of timeframe,
despite the usual "DRAM wants 60ms refresh". As in "maybe 90% of bits
might still be legible". And newer DRAM with smaller capacitors isn't
apparently a downside, because they have much less leakage too.
But some of those papers were for old DRAM. Maybe somebody knows
better. I don't have any real data myself, because my cold-boot tests
all seemed to show the BIOS reinitializing it to garbage. For all I
know, the DRAM training will guarantee garbage and it's all a pipe
dream.
Anyway, from some wild handwaving of "maybe we can get 90% bit
retention" means that a human can read garbled data and guess
(particularly if you can see "ok, it's an oops, I know what the
overall pattern is, I can ignore a lot of bits that don't matter").
But I wouldn't want to necessarily automte it all that much.
But the retention pattern might be very random too, and honestly, I'm
mostly guessing to begin with (if that wasn't clear already ;).
But the "random user didn't have any other choice but to just
powercycle the machine" is one of the nastiest debug problems we have
right now, and if we were to get "next boot automatically sends a
report to the distro or whatever after a non-clean shutdown" that
might be *very* useful.
Or it might not be. Right now we simply don't have that kind of data
at all. Sure, we have a ton of virtual machines and servers that have
"reliable IO" (either thanks to the VM or thanks to serial lines etc),
but it's literally the "normal random consumer who runs
Fedora/Ubuntu/Suse workstation" that currently basically has no data
at all if it's the kind of crash that doesn't get you a saved log.
And the people running VM's and servers with serial lines are simply
not doing the same things as real people on real hardware are, so I
don't think it's an argument that "hey, we get reports from those nice
datacenter guys".
We likely don't even have any idea of how common it is, because while
I know "hangs on resume with no logs" used to be a fairly common
problem case, it by definition never gets _logged_. Maybe people
complain, but more likely they just curse and reboot.
And no, I don't think this is actually common at all. But the problem
with those unloggable problems is that _if_ they happen - even if it's
very very rare indeed - they are really nasty to debug.
They are nasty to debug when they happen on a developer machine (I
should know, I've definitely had them), but when they happen in the
wild they are basically "user just rebooted the machine". End of
story, and no stats or anything like that.
Linus