Re: [RFC][PATCH] printk: Fixup the nmi printk mess

From: Petr Mladek
Date: Wed Jun 10 2015 - 10:57:51 EST


On Wed 2015-06-10 16:31:55, Petr Mladek wrote:
> On Wed 2015-06-10 14:55:09, Peter Zijlstra wrote:
> > Hi,
> >
> > I just stumbled upon:
> >
> > a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")
> >
> > Which is not dissimilar to what I've proposed in the past. Except its
> > squirreled away in some far and dark corner of one arch.
> >
> > Lets fix that.
>
> It is very interesting approach. But I see two possible races, see below.
>
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index dfd59d6bc6f0..a477e0766d2e 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -67,10 +67,12 @@ extern void irq_exit(void);
> > preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \
> > rcu_nmi_enter(); \
> > trace_hardirq_enter(); \
> > + printk_nmi_enter(); \
> > } while (0)
> >
> > #define nmi_exit() \
> > do { \
> > + printk_nmi_exit(); \
> > trace_hardirq_exit(); \
> > rcu_nmi_exit(); \
> > BUG_ON(!in_nmi()); \
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index caebf2a758dc..04c68b9f56f8 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -135,7 +135,4 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> > (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
> > __alignof__(type))
> >
> > -/* To avoid include hell, as printk can not declare this, we declare it here */
> > -DECLARE_PER_CPU(printk_func_t, printk_func);
> > -
> > #endif /* __LINUX_PERCPU_H */
> > diff --git a/init/main.c b/init/main.c
> > index 2115055faeac..4a28accaaa98 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -586,6 +586,7 @@ asmlinkage __visible void __init start_kernel(void)
> > timekeeping_init();
> > time_init();
> > sched_clock_postinit();
> > + printk_init();
> > perf_event_init();
> > profile_init();
> > call_function_init();
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index c099b082cd02..0149c41be95b 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -1821,13 +1821,125 @@ int vprintk_default(const char *fmt, va_list args)
> > }
> > EXPORT_SYMBOL_GPL(vprintk_default);
> >
> > +#ifdef CONFIG_PRINTK_NMI
> > +
> > +typedef int(*printk_func_t)(const char *fmt, va_list args);
> > /*
> > * This allows printk to be diverted to another function per cpu.
> > * This is useful for calling printk functions from within NMI
> > * without worrying about race conditions that can lock up the
> > * box.
> > */
> > -DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +static DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +
> > +#include <linux/seq_buf.h>
> > +
> > +struct nmi_seq_buf {
> > + struct seq_buf seq;
> > + struct irq_work work;
> > + unsigned char buffer[PAGE_SIZE -
> > + sizeof(struct seq_buf) -
> > + sizeof(struct irq_work)];
> > +};
> > +
> > +/* Safe printing in NMI context */
> > +static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
> > +
> > +static void print_seq_line(struct nmi_seq_buf *s, int start, int end)
> > +{
> > + const char *buf = s->buffer + start;
> > +
> > + printk("%.*s", (end - start) + 1, buf);
> > +}
> > +
> > +static void __printk_nmi_flush(struct irq_work *work)
> > +{
> > + struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work);
> > + int len, last_i = 0, i = 0;
> > +
> > +again:
> > + len = seq_buf_used(&s->seq);
> > + if (!len)
> > + return;
> > +
> > + /* Print line by line. */
> > + for (; i < len; i++) {
> > + if (s->buffer[i] == '\n') {
> > + print_seq_line(s, last_i, i);
> > + last_i = i + 1;
> > + }
> > + }
> > + /* Check if there was a partial line. */
> > + if (last_i < len) {
> > + print_seq_line(s, last_i, len - 1);
> > + pr_cont("\n");
> > + }
> > +
> > + /*
> > + * Another NMI could have come in while we were printing
> > + * check if nothing has been added to the buffer.
> > + */
> > + if (cmpxchg_local(&s->seq.len, len, 0) != len)
> > + goto again;
>
> If another NMI comes at this point, it will start filling the buffer
> from the beginning. If it is fast enough, it might override the text
> that we print above.
>
> Note that it is easy to be faster because the normal printk() is more
> complex and might wait for console.

Also note that show_regs() calls many separate printk()s, the irqwork
is scheduled by the first one => it is quite likely that some
backtrace will get messed.


Anothrer problem is that __printk_nmi_flush() is per-CPU => more
instances might be running in parallel. I haven't tested this but
I quess that it will mix backtraces from different CPUs in
the main ring buffer.

Note that the backtraces used to be serialized via
static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED
in the past. See the commit
a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")


A solution would be to block __printk_nmi_flush() until
printk_nmi_exit() is called via some bit and cmpxchg.

Also we could add one more bit and vprintk_nmi() could drop messages
until the buffer is printed to avoid any mess.

Huh, it is getting more and more complicated.


Best Regards,
Petr

> > +}
> > +
> > +void printk_init(void)
> > +{
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu) {
> > + struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu);
> > +
> > + init_irq_work(&s->work, __printk_nmi_flush);
> > + seq_buf_init(&s->seq, s->buffer, sizeof(s->buffer));
> > + }
> > +}
> > +
> > +/*
> > + * It is not safe to call printk() directly from NMI handlers.
> > + * It may be fine if the NMI detected a lock up and we have no choice
> > + * but to do so, but doing a NMI on all other CPUs to get a back trace
> > + * can be done with a sysrq-l. We don't want that to lock up, which
> > + * can happen if the NMI interrupts a printk in progress.
> > + *
> > + * Instead, we redirect the vprintk() to this nmi_vprintk() that writes
> > + * the content into a per cpu seq_buf buffer. Then when the NMIs are
> > + * all done, we can safely dump the contents of the seq_buf to a printk()
> > + * from a non NMI context.
> > + */
> > +static int vprintk_nmi(const char *fmt, va_list args)
> > +{
> > + struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
> > + unsigned int len = seq_buf_used(&s->seq);
> > +
> > + irq_work_queue(&s->work);
> > + seq_buf_vprintf(&s->seq, fmt, args);
>
> This is more critical. seq_buf_vprintf() has the following code:
>
> len = vsnprintf(s->buffer + s->len, s->size - s->len, fmt, args);
>
> The race might look like:
>
> CPU0 CPU1
>
> __printk_nmi_flush()
> ...
>
> seq_buf_vprintf()
>
> s->buffer + s->len
>
> cmpxchg_local(&s->seq.len,
> len, 0)
>
> s->size - s->len
>
> => that seq_buf_vprint() will start writing to the end of the seq_buf
> but it will think that the entire writer is available and it might
> cause buffer overflow
>
>
> Well, it is rather theoretical. s->len will be most likely read only
> once (compiler optimization). Anyway, we must make make sure that
> all the used seq_buf operations are save against such races.
> It is not obvious and might be prone to regressions.
>
> To be honest, I am not familiar with cmpxchg_local(). But I think that
> it can't do much difference. The value has to be synced to the other
> CPUs one day.
>
>
> Note that Steven's solution is safe because the prints are serialized via
> @backtrace_flag in arch_trigger_all_cpu_backtrace().
>
>
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/