Re: [syzbot] BUG: sleeping function called from invalid context in __might_resched

From: Fabio M. De Francesco
Date: Tue Nov 16 2021 - 03:54:07 EST


On Tuesday, November 16, 2021 9:09:11 AM CET syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch but the reproducer is still triggering
an issue:
> BUG: sleeping function called from invalid context in __might_resched
>
> BUG: sleeping function called from invalid context at kernel/printk/
printk.c:2522
> in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 8755, name: syz-
executor.2
> preempt_count: 1, expected: 0
> RCU nest depth: 0, expected: 0
> 3 locks held by syz-executor.2/8755:
> #0: ffff888070c9a098
> (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x22/0x80 drivers/
tty/tty_ldisc.c:252
> #1: ffff888070c9a468
> (&tty->flow.lock){....}-{2:2}, at: spin_lock_irq include/linux/spinlock.h:
374 [inline]
> (&tty->flow.lock){....}-{2:2}, at: n_tty_ioctl_helper+0xb6/0x2d0 drivers/
tty/tty_ioctl.c:877
> #2: ffff888070c9a098 (&tty->ldisc_sem){++++}-{0:0}, at:
tty_ldisc_ref+0x1d/0x80 drivers/tty/tty_ldisc.c:273
> irq event stamp: 916
> hardirqs last enabled at (915): [<ffffffff81beabd5>]
kasan_quarantine_put+0xf5/0x210 mm/kasan/quarantine.c:220
> hardirqs last disabled at (916): [<ffffffff8950a731>] __raw_spin_lock_irq
include/linux/spinlock_api_smp.h:117 [inline]
> hardirqs last disabled at (916): [<ffffffff8950a731>]
_raw_spin_lock_irq+0x41/0x50 kernel/locking/spinlock.c:170
> softirqs last enabled at (0): [<ffffffff8144cf0c>] copy_process+0x1e8c/
0x75a0 kernel/fork.c:2136
> softirqs last disabled at (0): [<0000000000000000>] 0x0
> Preemption disabled at:
> [<0000000000000000>] 0x0
> CPU: 1 PID: 8755 Comm: syz-executor.2 Not tainted 5.16.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> __might_resched.cold+0x222/0x26b kernel/sched/core.c:9542
> console_lock+0x17/0x80 kernel/printk/printk.c:2522
> con_flush_chars drivers/tty/vt/vt.c:3365 [inline]
> con_flush_chars+0x35/0x90 drivers/tty/vt/vt.c:3357
> con_write+0x2c/0x40 drivers/tty/vt/vt.c:3296

The reproducer is still triggering an issue, but this time it looks like it
is triggered by a different path of execution.

The same invalid "in_interrupt()" test is also in con_flush_chars().

Let's try to remove it too...

My first idea would be to replace "if (in_interrupt())" with the same
"preempt_count() || irqs_disabled()" I used in do_con_write(). However I
noticed that both do_con_write() and con_flush_chars() are only called from
inside con_write() (which, aside from calling those functions, does nothing
else).

So why not remove the if (in_interrupt()) from both them and use if
(preempt_count() || irqs_disabled()) just once in con_write()?

I think this should be the right solution, but I prefer to go one step at a
time.

Therefore, I'll (1) use the same (redundant, if it was used in con_write())
test also in con_flush_chars(), (2) wait for Syzbot to confirm that it fixes
the bug, and (3) wait for maintainers review and suggestions about whether or
not moving those tests one level upper.

#syz test:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

---
Fabio M. De Francesco
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 7359c3e80d63..46511d1ac6ee 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2902,7 +2902,7 @@ static int do_con_write(struct tty_struct *tty, const unsigned char *buf, int co
struct vt_notifier_param param;
bool rescan;

- if (in_interrupt())
+ if (preempt_count() || irqs_disabled())
return count;

console_lock();
@@ -3358,7 +3358,7 @@ static void con_flush_chars(struct tty_struct *tty)
{
struct vc_data *vc;

- if (in_interrupt()) /* from flush_to_ldisc */
+ if (preempt_count() || irqs_disabled()) /* from flush_to_ldisc */
return;

/* if we race with con_close(), vt may be null */