Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports

From: Alexander Potapenko
Date: Mon Jan 18 2021 - 08:09:28 EST

On Mon, Jan 18, 2021 at 12:38 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
Thanks for your input! Some responses below.

> On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote:
> > With the introduction of various production error-detection tools, such as
> > MTE-based KASAN and KFENCE, the need arises to efficiently notify the
> > userspace OS components about kernel errors. Currently, no facility exists
> > to notify userspace about a kernel error from such bug-detection tools.
> > The problem is obviously not restricted to the above bug detection tools,
> > and applies to any error reporting mechanism that does not panic the
> > kernel; this series, however, will only add support for KASAN and KFENCE
> > reporting.
> >
> > +++ b/lib/error_report_notify.c
> > +/*
> > + * Trace hook for the error_report_start event. In an unlikely case of another
> > + * task already printing a report bail out, otherwise save the current pid
> > + * together with in_task() return value.
> This is not reliable. Some events might get lost.

> > +
> > + /* Pairs with acquire in last_report_show(). */
> > + atomic_inc_return_release(&num_reports);
> > + schedule_delayed_work(&reporting_done, 0);
> Why delayed work when it gets queued immediately?

Because error reports may be sent from a place where waiting is
forbidden (in the case of KFENCE that is the page fault handler).
A delayed work ensures the notification occurs in a wait-safe context.

> > +static void skip_extra_info(const char **buf, size_t *len)
> > +{
> > + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
> The timestamp can be disabled also at runtime by
> /sys/module/printk/parameters/time

Guess that would result in disabling it for the whole kernel, which is
too aggressive.

> IMHO, removing the timestamp is a bad idea. It will complicate
> matching the message with other events.

Given that we are moving away from sysfs, the report size limit can be
increased, therefore we don't have to remove the timestamp to save
space anymore.

> I am afraid that some reports would get shrunken anyway.
Do you have any particular subsystem in mind? We could probably
increase the report buffer to, say, 16K, to avoid this.

> I suggest to use some lockless ringbuffer to store these
> messages. Either ftrace or printk one.
> But the biggest problem would be that console might be called
> asynchronously from another process. The messages might get
> lost in the meantime.

Do you mean that one process may call printk(), but the console probe
for that message will be executed in the context of another process?
If so, that would be a problem. Could it be possible to extend the
console trace event to pass an extra caller id?

> There are many other more reliable solutions:
> Either add hook into vprintk_store() and make another copy of the
> message into your buffer. But there will still be the problem
> that you could not store more reports in parallel. So, it won't
> be reliable anyway.

At least for now, we do not expect many reports to come in parallel:
this feature is meant for collecting the long tail of memory
corruptions from production devices.

> Or add a hook into /dev/kmsg interface. It allows to read any
> stored message immediately and quickly. Well, you would need
> to have your own reader of this interface.

This probably could be done, but isn't the console tracepoint designed
exactly for this?

> Or add your own reader of the main printk log.
> Or give up on having your own buffers. Instead just inform userspace
> about a new report via the pooled interface. And get the messages
> from the main log, e.g. via /dev/kmsg.

Yes, that could be an option, although the need to search for
tool-specific strings in kmsg and figure out where the report starts
and ends still persists.
At the end of the day, this sounds like the best approach for now,
that we could later extend to store the logs.

> I agree with Andrew that this might be _very_ dangerous feature.
> It is another way how to get messages from kernel. It looks like
> a hack how to get certain messages from the main kernel log
> into a separate log. This might end up with implementing
> yet another printk() or trace_printk().

Initially our intent was to refactor KFENCE error reporting so that
the whole report is stored in a buffer and can be exposed to userspace
via a VFS entry.
Then it turned out we'd need something similar for KASAN, which
would've also resulted in a major rewrite.
So we thought a much more generic solution would be to let the tools
"tee" their output to a specific file on demand.
Yes, sounds like duplicating the kernel log :)
On the other hand, we never really planned to extend printk or
duplicate parts of it, so unless trace_console() is a good fit we'd
probably refrain from storing the logs for now.

> I suggest to integrate this better with printk():
> + We already have printk_context() that allows to handle
> messages a special way.
> + Just store extra flag into the main log buffer for these
> messages.
> + Allow to read the messages via some interface, either
> /dev/kmsg or a new one. We already store PID and context
> info into the main log buffer. It will allow to put
> relevant lines together.
> + Anyway, I strongly suggest to avoid your own buffer. It looks
> like an overhead. It it would get more complicated when
> there are more users, ...

Having the special flag for the error reports is handy, but given the
existence of CONFIG_PRINTK_CALLER may not be worth the effort, at
least for now.

> I still have to think about it. This feature heavily depends on
> printk(). It makes sense to integrate it there. But we also
> need to keep it simple because printk() is already pretty
> complicated.
> Best Regards,
> Petr

Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg