Re: [WARNING: A/V UNSCANNABLE][Merge tag 'media/v4.11-1' of git] ff58d005cd: BUG: unable to handle kernel NULL pointer dereference at 0000039c

From: Sean Young
Date: Sat Feb 25 2017 - 06:17:16 EST


On Fri, Feb 24, 2017 at 11:15:51AM -0800, Linus Torvalds wrote:
> Added more relevant people. I've debugged the immediate problem below,
> but I think there's another problem that actually triggered this.
>
> On Fri, Feb 24, 2017 at 10:28 AM, kernel test robot
> <fengguang.wu@xxxxxxxxx> wrote:
> >
> > 0day kernel testing robot got the below dmesg and the first bad commit is
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >
> > commit ff58d005cd10fcd372787cceac547e11cf706ff6
> > Merge: 5ab3566 9eeb0ed
> >
> > Merge tag 'media/v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> [...]
> > [ 4.664940] rc rc0: lirc_dev: driver ir-lirc-codec (rc-loopback) registered at minor = 0
> > [ 4.666322] BUG: unable to handle kernel NULL pointer dereference at 0000039c
> > [ 4.666675] IP: serial_ir_irq_handler+0x189/0x410
>
> This merge being fingered ends up being a subtle interaction with other changes.
>
> Those "other changes" are (again) the interrupt retrigger code that
> was reverted for 4.10, and then we tried to merge them again this
> merge window.
>
> Because the immediate cause is:
>
> > [ 4.666675] EIP: serial_ir_irq_handler+0x189/0x410
> > [ 4.666675] Call Trace:
> > [ 4.666675] <IRQ>
> > [ 4.666675] __handle_irq_event_percpu+0x57/0x100
> > [ 4.666675] handle_irq_event_percpu+0x1d/0x50
> > [ 4.666675] handle_irq_event+0x32/0x60
> > [ 4.666675] handle_edge_irq+0xa5/0x120
> > [ 4.666675] handle_irq+0x9d/0xd0
> > [ 4.666675] </IRQ>
> > [ 4.666675] do_IRQ+0x5f/0x130
> > [ 4.666675] common_interrupt+0x33/0x38
> > [ 4.666675] EIP: hardware_init_port+0x3f/0x190
> > [ 4.666675] EFLAGS: 00200246 CPU: 0
> > [ 4.666675] EAX: c718990f EBX: 00000000 ECX: 00000000 EDX: 000003f9
> > [ 4.666675] ESI: 000003f9 EDI: 000003f8 EBP: c0065d98 ESP: c0065d84
> > [ 4.666675] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> > [ 4.666675] serial_ir_probe+0xbb/0x300
> > [ 4.666675] platform_drv_probe+0x48/0xb0
> ...
>
> ie an interrupt came in immediately after the request_irq(), before
> all the data was properly set up, which then causes the interrupt
> handler to take a fault because it tries to access some field that
> hasn't even been set up yet.

Oh dear. I've pointed out others making the same mistake when doing code
reviews, clearly I need review my own code better.

>
> The code line is helpful, the faulting instruction is
>
> mov 0x39c(%rax),%eax <--- fault
> call ..
> mov someglobalvar,%edx
>
> which together with the supplied config file makes me able to match it
> up with the assembly generation around it:
>
> inb %dx, %al # tmp254, value
> andb $1, %al #, tmp255
> testb %al, %al # tmp255
> je .L233 #,
> .L215:
> movl serial_ir+8, %eax # serial_ir.rcdev, serial_ir.rcdev
> xorl %edx, %edx # _66->timeout
> movl 924(%eax), %eax # _66->timeout, _66->timeout
> call nsecs_to_jiffies #
> movl jiffies, %edx # jiffies, jiffies.33_70
> addl %eax, %edx # _69, tmp259
> movl $serial_ir+16, %eax #,
> call mod_timer #
> movl serial_ir+8, %eax # serial_ir.rcdev,
> call ir_raw_event_handle #
> movl $1, %eax #, <retval>
>
> so it's that "serial_ir.rcdev->timeout" access that faults. So this is
> the faulting source code:
>
> drivers/media/rc/serial_ir.c: 402
>
> mod_timer(&serial_ir.timeout_timer,
> jiffies + nsecs_to_jiffies(serial_ir.rcdev->timeout));
>
> ir_raw_event_handle(serial_ir.rcdev);
>
> return IRQ_HANDLED;
>
> and serial_ir.rcdev is NULL when ti tries to look up the timeout.

ir_raw_event_handle() call will also go bang if passed a null pointer, so
this problem existed before (since v4.10).

Thanks for debugging this, I'll send a patch as a reply to this email.


Sean