Re: [PATCH 2/2] printk: always report lost messages on serial console

From: Sergey Senozhatsky
Date: Tue Jan 10 2017 - 03:49:48 EST


On (01/09/17 17:56), Petr Mladek wrote:
> I have looked at this once again. Please, do you have a real-life
> example when the "printk messages dropped" was not printed?

.....


> By other words, the search for a visible message is finished when
> we find one or when the entire log buffer is checked.
>
> As a consequence, the warning about "printk messages dropped" is
> lost _only_ when there was no visible message in the entire buffer.
>
> Therefore, the warning might get lost only when the log buffer
> was very small or when console_level filter was too restricting.

with a slow serial console even a relatively large logbuf can
easily become a "not quite large enough" logbuf. especially
when the underlying uart_port ->lock is concurrently accessed
by uart circ_buf flushing.


> I wonder if this is a case for a real user or a real life.
> In each case, such a scenario most likely comes from a very
> curious setup.
>
> It is possible that your fix is fine. If we lose messages,
> we are screwed anyway. But I still have problems to accept
> that we would start printing less important messages (that would
> normally be ignored) in situation when we have troubles
> to print the more important ones. This logic rings warning
> bells in my head and this is why I suggest more conservative
> solution and ask the many questions.

once the system is in "oh, let me drop some of the messages for you"
mood, loglevel filtering is unreliable and in some cases unneeded.
it's so unreliable that I'm even considering disabling it in *in-house*
builds when console_unlock() detects that there was no room for all
'yet to be seen' messages.


those are another messages, with 'visible' loglevel or with 'suppressed'
loglevel or both 'visible' and 'suppressed' loglevels, that caused the
logbuf overflow.

now, if the loss of messages was caused by:

a) flood of suppressed loglevel messages
then printing at least some of those messages makes *a lot* of sense.
it really does. because those messages will point at the root cause of
the problem. and there is no other way of finding that out when all you
have is a serial log of visible messages. those 'suppressed messages'
part is the key. whatever loglevel it was, the fact that we overflowed
the logbuf with those messages is already bad.



b) flood of visible loglevel messages
then may be those messages are not so important. there a whole logbuf of
them. per my experience, it is quite hard to overflow the logbuf with
really important, unique, sensible messages of 'visible' loglevel with
active loglevel filtering. in most of the cases I've seen, those "important"
messages are the same repeating messages. OOM is the most notable example:
it can flood logbuf with pretty much same backtraces and Mem-Info printouts.
for instance - CPU that stuck in GFP_REPEAT allocation under OOM. or memory
allocation error in swap out path under OOM:

...
[3282.5292] Node 0, zone XXXX, type Isolate 0 0 0 0 0 0 0 0 0 0 0
[3282.5292] Node 0, zone XXXX, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
** 1389 printk messages dropped ** [3282.5430] Write-error on swap-device (254:0:1471944)
** 354 printk messages dropped ** [3282.5526] Write-error on swap-device (254:0:1473240)
** 254 printk messages dropped ** [3282.5598] Write-error on swap-device (254:0:1474472)
[3282.5598] SLUB: Unable to allocate memory on xxxxxxx
blah-blah-blah
...
[3282.5606] Write-error on swap-device (254:0:1474480)
[3282.5607] Write-error on swap-device (254:0:1474496)
[3282.5607] Write-error on swap-device (254:0:1474504)
[3282.5608] Write-error on swap-device (254:0:1474504)
[3282.5610] Write-error on swap-device (254:0:1474528)
[3282.5611] Write-error on swap-device (254:0:1474544)
** 519 printk messages dropped ** [3282.5611] Write-error on swap-device (254:0:1477152)
** 15 printk messages dropped ** [3282.5616] Write-error on swap-device (254:0:1477208)
...
[3283.0532] zram: Error allocating memory for compressed page: 200385, size=2824
...

those messages preempt each other and preempt other *may be actually
important* 'visible' loglevel messages. but there is no way to tell
which one of the messages vprintk_emit() is about to overwrite is actually
important. the outcome is -- the important repeating messages are _mostly_
as valuable as the suppressed ones. IOW, not every important message is
really important under 'out of logbuf' condition. and vice versa.



c) both a) and b)
printing both suppressible and 'visible' loglevels makes sense here.


considering a) and b):

once the system is out of logbuf space it is impossible to clearly
distinguish between 'important' and 'not so important' messages. all
we know in console_unlock(), when we pick up next_idx message, is that
there is an abnormal/unusual/weird/unexpected/sick/whatever amount of
messages - 'suppressed' or 'visible' or both. and that's the problem.

"print as much of those messages as possible, don't hide the problem" - that's
why I'm thinking of no loglevel filtering in 'ouf of logbuf space' case.

-ss