Re: [BUG] possible deadlock in __schedule (with reproducer available)
From: Akinobu Mita
Date: Sun Dec 01 2024 - 07:53:56 EST
2024年11月29日(金) 21:09 Peter Zijlstra <peterz@xxxxxxxxxxxxx>:
>
> On Fri, Nov 29, 2024 at 05:35:54PM +0900, Masami Hiramatsu wrote:
> > On Sat, 23 Nov 2024 03:39:45 +0000
> > Ruan Bonan <bonan.ruan@xxxxxxxxx> wrote:
> >
> > >
> > > vprintk_emit+0x414/0xb90 kernel/printk/printk.c:2406
> > > _printk+0x7a/0xa0 kernel/printk/printk.c:2432
> > > fail_dump lib/fault-inject.c:46 [inline]
> > > should_fail_ex+0x3be/0x570 lib/fault-inject.c:154
> > > strncpy_from_user+0x36/0x230 lib/strncpy_from_user.c:118
> > > strncpy_from_user_nofault+0x71/0x140 mm/maccess.c:186
> > > bpf_probe_read_user_str_common kernel/trace/bpf_trace.c:215 [inline]
> > > ____bpf_probe_read_user_str kernel/trace/bpf_trace.c:224 [inline]
> >
> > Hmm, this is a combination issue of BPF and fault injection.
> >
> > static void fail_dump(struct fault_attr *attr)
> > {
> > if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) {
> > printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n"
> > "name %pd, interval %lu, probability %lu, "
> > "space %d, times %d\n", attr->dname,
> > attr->interval, attr->probability,
> > atomic_read(&attr->space),
> > atomic_read(&attr->times));
> >
> > This printk() acquires console lock under rq->lock has been acquired.
> >
> > This can happen if we use fault injection and trace event too because
> > the fault injection caused printk warning.
>
> Ah indeed. Same difference though, if you don't know the context, most
> things are unsafe to do.
>
> > I think this should be a bug of the fault injection, not tracing/BPF.
> > And to solve this issue, we may be able to check the context and if
> > it is tracing/NMI etc, fault injection should NOT make it failure.
>
> Well, it should be okay to cause the failure, but it must be very
> careful how it goes about doing that. Tripping printk() definitely is
> out.
>
> But there's a much bigger problem there, get_random*() is not wait-free,
> in fact it takes a spinlock_t which makes that it is unusable from most
> context, and it's definitely out for tracing.
>
> Notably, this spinlock_t makes that it is unsafe to use from anything
> that holds a raw_spinlock_t or is from hardirq context, or has
> preempt_disable() -- which is a TON of code.
>
> On this alone I would currently label the whole of fault-injection
> broken. The should_fail() call itself is unsafe where many of its
> callsites are otherwise perfectly fine -- eg. usercopy per the above.
>
> Perhaps it should use a simple PRNG, a simple LFSR should be plenty good
> enough to provide failure conditions.
Sounds good.
> And yeah, I would just completely rip out the printk. Trying to figure
> out where and when it's safe to call printk() is non-trivial and just
> not worth the effort imo.
Instead of removing the printk completely, How about setting the default value
of the verbose option to zero so it doesn't call printk and gives a loud
warning when changing the verbose option?