Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

From: Mohammed Naser
Date: Tue May 26 2015 - 14:05:19 EST


Also, regarding replication.. I'll try to make a setup that "mimics"
the crash but unfortunately I don't have a full reproducible case (but
when cgroup bandwidth limits are in place, server cannot last 24 hours
up)

Thanks!
Mohammed

On Tue, May 26, 2015 at 1:56 PM, Mohammed Naser <mnaser@xxxxxxxxxxxx> wrote:
> Hi Benjamin,
>
> Thank you for getting in touch. With as much as I can do (and I'm
> documenting this just to be sure I didn't take the wrong path), here
> is what I can provide:
>
> The crash occurred at:
>
> [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> The value of RDI will be memory address of the first argument, RSI
> will be the memory address of the second argument, taking them out
> from the BUG message:
>
> RDI: ffff883ffedf3140 (cfs_rq)
> RSI: 0000000000000000 (se, NULL in the case)
>
> Therefore, `cfs_rq` is at `ffff883ffedf3140`.. looking it up...
>
> ==========================================================
> crash> cfs_rq ffff883ffedf3140
> struct cfs_rq {
> load = {
> weight = 0,
> inv_weight = 0
> },
> nr_running = 0,
> h_nr_running = 0,
> exec_clock = 50727270151190,
> min_vruntime = 1126033353437518,
> tasks_timeline = {
> rb_node = 0x0
> },
> rb_leftmost = 0x0,
> curr = 0x0,
> next = 0x0,
> last = 0x0,
> skip = 0x0,
> nr_spread_over = 960,
> runnable_load_avg = 0,
> blocked_load_avg = 108,
> decay_counter = {
> counter = 139289244
> },
> last_decay = 139289243,
> removed_load = {
> counter = 0
> },
> tg_runnable_contrib = 329,
> tg_load_contrib = 108,
> h_load = 75,
> last_h_load_update = 4331424052,
> h_load_next = 0xffff883fcab5d600,
> rq = 0xffff883ffedf30c0,
> on_list = 1,
> leaf_cfs_rq_list = {
> next = 0xffff883eec13eec0,
> prev = 0xffff883fcab5fec0
> },
> tg = 0xffffffff81ebdd80 <root_task_group>,
> runtime_enabled = 0,
> runtime_expires = 0,
> runtime_remaining = 0,
> throttled_clock = 0,
> throttled_clock_task = 0,
> throttled_clock_task_time = 0,
> throttled = 0,
> throttle_count = 0,
> throttled_list = {
> next = 0xffff883ffedf3250,
> prev = 0xffff883ffedf3250
> }
> }
> ==========================================================
>
> Following the memory address of tg at 0xffffffff81ebdd80
>
> ==========================================================
> crash> task_group ffffffff81ebdd80
> struct task_group {
> css = {
> cgroup = 0xffff883fd0858010,
> ss = 0xffffffff81c4e380 <cpu_cgrp_subsys>,
> refcnt = {
> count = {
> counter = 0
> },
> pcpu_count = 0x0,
> release = 0x0,
> confirm_kill = 0x0,
> rcu = {
> next = 0x0,
> func = 0x0
> }
> },
> parent = 0x0,
> sibling = {
> next = 0xffffffff81ebddc8 <root_task_group+72>,
> prev = 0xffffffff81ebddc8 <root_task_group+72>
> },
> children = {
> next = 0xffff881fcfab0448,
> prev = 0xffff883fd1ad7048
> },
> id = 1,
> flags = 3,
> serial_nr = 2,
> callback_head = {
> next = 0x0,
> func = 0x0
> },
> destroy_work = {
> data = {
> counter = 0
> },
> entry = {
> next = 0x0,
> prev = 0x0
> },
> func = 0x0
> }
> },
> se = 0xffff881fff010000,
> cfs_rq = 0xffff881fff010480,
> shares = 1024,
> load_avg = {
> counter = 1234
> },
> runnable_avg = {
> counter = 20014
> },
> rcu = {
> next = 0x0,
> func = 0x0
> },
> list = {
> next = 0xffffffff81c4e830 <task_groups>,
> prev = 0xffff883fd1b710e0
> },
> parent = 0x0,
> siblings = {
> next = 0xffffffff81ebde78 <root_task_group+248>,
> prev = 0xffffffff81ebde78 <root_task_group+248>
> },
> children = {
> next = 0xffff883a29cd58f8,
> prev = 0xffff883fd1b710f8
> },
> autogroup = 0x0,
> cfs_bandwidth = {
> lock = {
> raw_lock = {
> {
> head_tail = 0,
> tickets = {
> head = 0,
> tail = 0
> }
> }
> }
> },
> period = {
> tv64 = 100000000
> },
> quota = 18446744073709551615,
> runtime = 0,
> hierarchal_quota = -1,
> runtime_expires = 0,
> idle = 0,
> timer_active = 0,
> period_timer = {
> node = {
> node = {
> __rb_parent_color = 18446744071594303192,
> rb_right = 0x0,
> rb_left = 0x0
> },
> expires = {
> tv64 = 0
> }
> },
> _softexpires = {
> tv64 = 0
> },
> function = 0xffffffff810add70 <sched_cfs_period_timer>,
> base = 0xffff881fff40d1a0,
> state = 0,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> slack_timer = {
> node = {
> node = {
> __rb_parent_color = 18446744071594303288,
> rb_right = 0x0,
> rb_left = 0x0
> },
> expires = {
> tv64 = 0
> }
> },
> _softexpires = {
> tv64 = 0
> },
> function = 0xffffffff810adcb0 <sched_cfs_slack_timer>,
> base = 0xffff881fff40d1a0,
> state = 0,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> throttled_cfs_rq = {
> next = 0xffffffff81ebdf98 <root_task_group+536>,
> prev = 0xffffffff81ebdf98 <root_task_group+536>
> },
> nr_periods = 0,
> nr_throttled = 0,
> throttled_time = 0
> }
> }
> ==========================================================
>
> I have the following for the `rq` if it helps (when it crashed)
>
> ==========================================================
> crash> rq.clock ffff883ffedf30c0
> clock = 146055357448781
> crash> rq.clock_task ffff883ffedf30c0
> clock_task = 14605535744878
> ==========================================================
>
> Sorry for the long email, wanted to provide as much as I can
>
> Thanks,
> Mohammed
>
> On Tue, May 26, 2015 at 1:38 PM, <bsegall@xxxxxxxxxx> wrote:
>> Kirill Tkhai <tkhai@xxxxxxxxx> writes:
>>
>>> Hi,
>>>
>>> 25.05.2015, 23:53, "Mohammed Naser" <mnaser@xxxxxxxxxxxx>:
>>>> Hi Peter,
>>>>
>>>> (resending as plain text, my bad)
>>>>
>>>> Thank you for reply.
>>>>
>>>> Would you have any ideas on why this would have occurred or other
>>>> steps to look at? It's my first time attempting to help fix a problem
>>>> like this.
>>>>
>>>> I have a crashdump of the kernel since this issue repeated itself a
>>>> few times on a loaded KVM host (it's 12GB however). I can also
>>>> provide the values of cfs_rq before the kernel crash.
>>>>
>>>> ===================================
>>>> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
>>>> at 0000000000000038
>>>> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>>> [146055.361890] PGD 0
>>>> [146055.364131] Oops: 0000 [#1] SMP
>>>> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>>>> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>>>> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
>>>> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
>>>> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
>>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>>> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
>>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
>>>> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
>>>> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
>>>> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>>>> crct10dif_pclmul crc32_pclmul
>>>> [146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw
>>>> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
>>>> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
>>>> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
>>>> megaraid_sas libahci mdio [last unloaded: ipmi_si]
>>>> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>>>> 3.16.0-37-generic #51~14.04.1-Ubuntu
>>>> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
>>>> 1.0.4 08/28/2014
>>>> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>>>> ffff883a1c168000
>>>> [146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>]
>>>> set_next_entity+0x11/0xb0
>>>> [146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
>>>> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>> 00000000044aa200
>>>> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>> ffff883ffedf3140
>>>> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>>>> ffff883ffedf3140
>>>> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>>>> ffff883ffedf30c0
>>>> [146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000)
>>>> knlGS:ffff880002380000
>>>> [146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>>>> 00000000001427e0
>>>> [146055.489134] Stack:
>>>> [146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f
>>>> ffff883a1c16bd68
>>>> [146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>>>> ffff883fcab69e90
>>>> [146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>>>> ffff883fcab6a3c8
>>>> [146055.519551] Call Trace:
>>>> [146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>>>> [146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>>>> [146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>>>> [146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
>>>> [146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
>>>> [146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
>>>> [146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>>>> [146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
>>>> [146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>>>> [146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>>>> [146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>>>> [146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>>>> [146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>>>> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
>>>> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
>>>> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
>>>> 8b 80
>>>> [146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>>> [146055.667524] RSP <ffff883a1c16bce8>
>>>> [146055.677082] CR2: 0000000000000038
>>>> ===================================
>>>
>>> This looks like https://lkml.org/lkml/2015/4/3/231
>>
>>
>> If you're using cfs_bandwidth, it's possible (though I can't tell just
>> from that). If so and you can reproduce this, could you test
>> https://lkml.org/lkml/2015/4/6/396 (assuming it still applies at all
>> anyway)?
>>
>> If you want to provide a bunch of debugging info, rq->clock,
>> rq->clock_task, *cfs_rq and *se for all rqs and cgroups would be ideal,
>> but getting all that is kinda annoying. The most basic info would be
>> *cfs_rq of the crashing cpu and *cfs_rq->tg.
>
>
>
> --
> Mohammed Naser â vexxhost
> -----------------------------------------------------
> D. 514-316-8872
> D. 800-910-1726 ext. 200
> E. mnaser@xxxxxxxxxxxx
> W. http://vexxhost.com



--
Mohammed Naser â vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. mnaser@xxxxxxxxxxxx
W. http://vexxhost.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/