Re: general protection fault in wb_workfn (2)

From: Dmitry Vyukov
Date: Fri Jun 08 2018 - 13:15:07 EST


On Fri, Jun 8, 2018 at 6:53 PM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> On Fri, Jun 8, 2018 at 5:16 PM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>>> On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
>>> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
>>>> Dmitry Vyukov wrote:
>>>>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>>>>> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>>>>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>>>>
>>>>> Hi Tetsuo,
>>>>>
>>>>> Most of the reasons for not doing it still stand. A syzkaller instance
>>>>> will produce not just this bug, it will produce hundreds of different
>>>>> bugs. Then the question is: what to do with these bugs? Report all to
>>>>> mailing lists?
>>>>
>>>> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
>>>> we can try debug patches easily, in addition to find bugs earlier than now.
>>>
>>> syzbot tested linux-next and mmotm initially, but they were removed at
>>> the request of kernel developers. See:
>>> https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
>>> and:
>>> https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
>>> Indeed, linux-next produces around 50 assorted one-off unexplainable
>>> bug reports.
>>>
>>>
>>>>> I think the solution here is just to run syzkaller instance locally.
>>>>> It's just a program anybody can run it on any kernel with any custom
>>>>> patches. Moreover for local instance it's also possible to limit set
>>>>> of tested syscalls to increase probability of hitting this bug and at
>>>>> the same time filter out most of other bugs.
>>>>
>>>> If this bug is reproducible with VM resources individual developer can afford...
>>>>
>>>> Since my Linux development environment is VMware guests on a Windows PC, I can't
>>>> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
>>>> utilize external VM resources available on the Internet, as well as I can't use ssh
>>>> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
>>>> reproduce this bug only once or twice per a day. Thus, the question for me boils
>>>> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
>>>> of memory. Effectively, I don't have access to environments for running syzkaller
>>>> instance...
>>>
>>> Well, I don't know what to say, it does require some resources.
>>>
>>>>> Do we have any idea about the guilty subsystem? You mentioned
>>>>> bdi_unregister, why? What would be the set of syscalls to concentrate
>>>>> on?
>>>>> I will do a custom run when I get around to it, if nobody else beats me to it.
>>>>
>>>> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
>>>> NULL pointer dereference.
>>>
>>> Right, wb_workfn is not a generic function, it's fs-specific function.
>>>
>>> Trying to reproduce this locally now.
>>
>>
>> No luck so far.
>>
>> Trying to look from a different angle: is it possible that bdi->dev is
>> not set yet, rather then already reset?
>
>
> I was able to reproduce this once locally running syz-crush utility
> replaying one of the crash logs. Now running with Tetsuo's patch.
>
> I can say we hunting a very subtle race condition with short
> inconsistency window, perhaps few instructions.


Here we go:

[ 2853.033175] WARNING: wb_workfn: device is NULL
[ 2853.034709] wb->state=2
[ 2853.035486] (wb == &wb->bdi->wb)=0
[ 2853.036489] list_empty(&wb->work_list)=1
[ 2853.037603] list_empty(&wb->bdi->bdi_list)=0
[ 2853.038843] wb->bdi->wb.state=0
[ 2853.039819] (wb->congested->__bdi == wb->bdi)=1
[ 2853.041062] list_empty(&wb->congested->__bdi->bdi_list)=0
[ 2853.042609] wb->congested->__bdi->wb.state=0
[ 2853.043793] kasan: CONFIG_KASAN_INLINE enabled
[ 2853.045315] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 2853.047376] general protection fault: 0000 [#1] SMP KASAN
[ 2853.048980] CPU: 1 PID: 13971 Comm: kworker/u12:8 Not tainted 4.17.0+ #21
[ 2853.050762] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 2853.053034] Workqueue: writeback wb_workfn
[ 2853.054193] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.055360] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.060692] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.062210] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.064215] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.066198] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.068037] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.069970] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.071932] FS: 0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.074080] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.075699] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.077633] Call Trace:
[ 2853.078341] ? inode_wait_for_writeback+0x40/0x40
[ 2853.079642] ? graph_lock+0x170/0x170
[ 2853.080710] ? lock_downgrade+0x8e0/0x8e0
[ 2853.081889] ? find_held_lock+0x36/0x1c0
[ 2853.083047] ? graph_lock+0x170/0x170
[ 2853.084105] ? lock_acquire+0x1dc/0x520
[ 2853.085216] ? process_one_work+0xb8c/0x1b70
[ 2853.086425] ? kasan_check_read+0x11/0x20
[ 2853.087608] ? __lock_is_held+0xb5/0x140
[ 2853.088720] process_one_work+0xc64/0x1b70
[ 2853.089875] ? finish_task_switch+0x182/0x840
[ 2853.091085] ? pwq_dec_nr_in_flight+0x490/0x490
[ 2853.092358] ? __schedule+0x809/0x1e30
[ 2853.093475] ? retint_kernel+0x10/0x10
[ 2853.094561] ? retint_kernel+0x10/0x10
[ 2853.095610] ? graph_lock+0x170/0x170
[ 2853.096623] ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.097981] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2853.099298] ? find_held_lock+0x36/0x1c0
[ 2853.100394] ? lock_acquire+0x1dc/0x520
[ 2853.101472] ? worker_thread+0x3d4/0x13a0
[ 2853.102558] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.104060] ? move_linked_works+0x2f6/0x470
[ 2853.105204] ? trace_event_raw_event_workqueue_execute_start+0x290/0x290
[ 2853.107013] ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.108243] worker_thread+0x9e5/0x13a0
[ 2853.109304] ? process_one_work+0x1b70/0x1b70
[ 2853.110488] ? graph_lock+0x170/0x170
[ 2853.111514] ? find_held_lock+0x36/0x1c0
[ 2853.112610] ? find_held_lock+0x36/0x1c0
[ 2853.113674] ? __schedule+0x1e30/0x1e30
[ 2853.114714] ? do_raw_spin_unlock+0x1f9/0x2e0
[ 2853.115885] ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.117060] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.118492] ? __kthread_parkme+0x111/0x1d0
[ 2853.119616] ? parse_args.cold.15+0x1b3/0x1b3
[ 2853.120804] ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.122071] ? trace_hardirqs_on+0xd/0x10
[ 2853.123117] kthread+0x345/0x410
[ 2853.123999] ? process_one_work+0x1b70/0x1b70
[ 2853.125174] ? kthread_bind+0x40/0x40
[ 2853.126201] ret_from_fork+0x3a/0x50
[ 2853.127199] Modules linked in:
[ 2853.128022] Dumping ftrace buffer:
[ 2853.128901] (ftrace buffer empty)
[ 2853.129986] ---[ end trace 3ba28e076cb32fda ]---
[ 2853.131269] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.132441] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.137449] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.138618] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.140176] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.141722] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.143300] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.144841] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.146391] FS: 0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.148141] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.149406] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.150968] Kernel panic - not syncing: Fatal exception
[ 2853.152419] Dumping ftrace buffer:
[ 2853.153121] (ftrace buffer empty)
[ 2853.153786] Kernel Offset: disabled
[ 2853.154442] Rebooting in 86400 seconds..