Re: [syzbot] BUG: sleeping function called from invalid context in __fdget_pos

From: Ard Biesheuvel
Date: Wed Jun 30 2021 - 03:42:28 EST


On Tue, 29 Jun 2021 at 16:46, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> ... adding Ard who was recently modifying some of the
> kernel_fpu_begin/end() sites in the AESNI crypto code.
>
> On 6/28/21 12:22 PM, syzbot wrote:
> > console output: https://syzkaller.appspot.com/x/log.txt?x=170e6c94300000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=42ecca11b759d96c
> > dashboard link: https://syzkaller.appspot.com/bug?extid=5d1bad8042a8f0e8117a
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> ...
> > BUG: sleeping function called from invalid context at kernel/locking/mutex.c:938
> > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 29652, name: syz-executor.0
> > no locks held by syz-executor.0/29652.
> > Preemption disabled at:
> > [<ffffffff812aa454>] kernel_fpu_begin_mask+0x64/0x260 arch/x86/kernel/fpu/core.c:126
> > CPU: 0 PID: 29652 Comm: syz-executor.0 Not tainted 5.13.0-rc7-syzkaller #0
>
> There's a better backtrace in the log before the rather useless
> backtrace from lockdep:
>
> > [ 1341.360547][T29635] FAULT_INJECTION: forcing a failure.
> > [ 1341.360547][T29635] name failslab, interval 1, probability 0, space 0, times 0
> > [ 1341.374439][T29635] CPU: 1 PID: 29635 Comm: syz-executor.0 Not tainted 5.13.0-rc7-syzkaller #0
> > [ 1341.374712][T29630] FAT-fs (loop2): bogus number of reserved sectors
> > [ 1341.383571][T29635] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > [ 1341.383591][T29635] Call Trace:
> > [ 1341.383603][T29635] dump_stack+0x141/0x1d7
> > [ 1341.383630][T29635] should_fail.cold+0x5/0xa
> > [ 1341.383651][T29635] ? skcipher_walk_next+0x6e2/0x1680
> > [ 1341.383673][T29635] should_failslab+0x5/0x10
> > [ 1341.383691][T29635] __kmalloc+0x72/0x330
> > [ 1341.383720][T29635] skcipher_walk_next+0x6e2/0x1680
> > [ 1341.383744][T29635] ? kfree+0xe5/0x7f0
> > [ 1341.383776][T29635] skcipher_walk_first+0xf8/0x3c0
> > [ 1341.383805][T29635] skcipher_walk_virt+0x523/0x760
> > [ 1341.445438][T29635] xts_crypt+0x137/0x7f0
> > [ 1341.449689][T29635] ? aesni_encrypt+0x80/0x80
>
> There's one suspect-looking site in xts_crypt():
>
> > kernel_fpu_begin();
> >
> > /* calculate first value of T */
> > aesni_enc(aes_ctx(ctx->raw_tweak_ctx), walk.iv, walk.iv);
> >
> > while (walk.nbytes > 0) {
> > int nbytes = walk.nbytes;
> >
> > ...
> >
> > err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
> >
> > kernel_fpu_end();
> >
> > if (walk.nbytes > 0)
> > kernel_fpu_begin();
> > }
>
> I wonder if a slab allocation failure could leave us with walk.nbytes==0.

The code is actually the other way around: kernel_fpu_end() comes
before the call to skcipher_walk_done().

So IIUC, this code forces an allocation failure, and checks whether
the code deals with this gracefully, right?

The skcipher walk API guarantees that walk.nbytes == 0 if an error is
returned, so the pairing of FPU begin/end looks correct to me. And
skcipher_walk_next() should not invoke anything that might sleep from
this particular context.

Herbert, any ideas?