Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

From: Kirill Tkhai
Date: Mon May 25 2015 - 18:01:24 EST


Hi,

25.05.2015, 23:53, "Mohammed Naser" <mnaser@xxxxxxxxxxxx>:
> Hi Peter,
>
> (resending as plain text, my bad)
>
> Thank you for reply.
>
> Would you have any ideas on why this would have occurred or other
> steps to look at?  It's my first time attempting to help fix a problem
> like this.
>
> I have a crashdump of the kernel since this issue repeated itself a
> few times on a loaded KVM host (it's 12GB however).  I can also
> provide the values of cfs_rq before the kernel crash.
>
> ===================================
> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000038
> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.361890] PGD 0
> [146055.364131] Oops: 0000 [#1] SMP
> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
> crct10dif_pclmul crc32_pclmul
> [146055.388889]  ghash_clmulni_intel aesni_intel aes_x86_64 lrw
> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
> megaraid_sas libahci mdio [last unloaded: ipmi_si]
> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
> 3.16.0-37-generic #51~14.04.1-Ubuntu
> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
> 1.0.4 08/28/2014
> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
> ffff883a1c168000
> [146055.421817] RIP: 0010:[<ffffffff810a7d31>]  [<ffffffff810a7d31>]
> set_next_entity+0x11/0xb0
> [146055.428079] RSP: 0018:ffff883a1c16bce8  EFLAGS: 00010092
> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 00000000044aa200
> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff883ffedf3140
> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
> 0000000000000001
> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
> ffff883ffedf3140
> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
> ffff883ffedf30c0
> [146055.467722] FS:  00007f404919d700(0000) GS:ffff883ffede0000(0000)
> knlGS:ffff880002380000
> [146055.474756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
> 00000000001427e0
> [146055.489134] Stack:
> [146055.496412]  0000000000000000 ffff883ffedf3140 000000000000001f
> ffff883a1c16bd68
> [146055.504053]  ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
> ffff883fcab69e90
> [146055.511786]  ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
> ffff883fcab6a3c8
> [146055.519551] Call Trace:
> [146055.527330]  [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
> [146055.535292]  [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
> [146055.543379]  [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
> [146055.551519]  [<ffffffff81768afb>] __schedule+0x11b/0x7a0
> [146055.559722]  [<ffffffff81769579>] _cond_resched+0x29/0x40
> [146055.568020]  [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
> [146055.576509]  [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
> [146055.585045]  [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
> [146055.593771]  [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
> [146055.602531]  [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
> [146055.611413]  [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
> [146055.620339]  [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
> [146055.629396]  [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
> 8b 80
> [146055.657833] RIP  [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.667524]  RSP <ffff883a1c16bce8>
> [146055.677082] CR2: 0000000000000038
> ===================================

This looks like https://lkml.org/lkml/2015/4/3/231

> Any pointers are appreciated and I'll do my best to do some more
> troubleshooting, just trying to understand the codebase is a task on
> it's own
>
> Thanks Peter,
> Mohammed
>
> On Mon, May 25, 2015 at 4:49 PM, Mohammed Naser <mnaser@xxxxxxxxxxxx> wrote:
>>  Hi Peter,
>>
>>  Thank you for reply.
>>
>>  Would you have any ideas on why this would have occurred or other steps to
>>  look at?  It's my first time attempting to help fix a problem like this.
>>
>>  I have a crashdump of the kernel since this issue repeated itself a few
>>  times on a loaded KVM host (it's 12GB however).  I can also provide the
>>  values of cfs_rq before the kernel crash.
>>
>>  ===================================
>>  [146055.357476] BUG: unable to handle kernel NULL pointer dereference at
>>  0000000000000038
>>  [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>  [146055.361890] PGD 0
>>  [146055.364131] Oops: 0000 [#1] SMP
>>  [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>>  act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>>  nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set
>>  nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth
>>  xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
>>  nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
>>  xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter
>>  ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan
>>  libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
>>  libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>>  x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>>  crct10dif_pclmul crc32_pclmul
>>  [146055.388889]  ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
>>  dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich
>>  ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1
>>  igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last
>>  unloaded: ipmi_si]
>>  [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>>  3.16.0-37-generic #51~14.04.1-Ubuntu
>>  [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4
>>  08/28/2014
>>  [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>>  ffff883a1c168000
>>  [146055.421817] RIP: 0010:[<ffffffff810a7d31>]  [<ffffffff810a7d31>]
>>  set_next_entity+0x11/0xb0
>>  [146055.428079] RSP: 0018:ffff883a1c16bce8  EFLAGS: 00010092
>>  [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>  00000000044aa200
>>  [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>  ffff883ffedf3140
>>  [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>>  0000000000000001
>>  [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>>  ffff883ffedf3140
>>  [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>>  ffff883ffedf30c0
>>  [146055.467722] FS:  00007f404919d700(0000) GS:ffff883ffede0000(0000)
>>  knlGS:ffff880002380000
>>  [146055.474756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>>  00000000001427e0
>>  [146055.489134] Stack:
>>  [146055.496412]  0000000000000000 ffff883ffedf3140 000000000000001f
>>  ffff883a1c16bd68
>>  [146055.504053]  ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>>  ffff883fcab69e90
>>  [146055.511786]  ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>>  ffff883fcab6a3c8
>>  [146055.519551] Call Trace:
>>  [146055.527330]  [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>>  [146055.535292]  [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>>  [146055.543379]  [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>>  [146055.551519]  [<ffffffff81768afb>] __schedule+0x11b/0x7a0
>>  [146055.559722]  [<ffffffff81769579>] _cond_resched+0x29/0x40
>>  [146055.568020]  [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460
>>  [kvm]
>>  [146055.576509]  [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>>  [146055.585045]  [<ffffffff81156952>] ?
>>  perf_event_context_sched_in+0xa2/0xc0
>>  [146055.593771]  [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>>  [146055.602531]  [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>>  [146055.611413]  [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>>  [146055.620339]  [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>>  [146055.629396]  [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>>  [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0
>>  75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53
>>  <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80
>>  [146055.657833] RIP  [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>  [146055.667524]  RSP <ffff883a1c16bce8>
>>  [146055.677082] CR2: 0000000000000038
>>  ===================================
>>
>>  Any pointers are appreciated and I'll do my best to do some more
>>  troubleshooting, just trying to understand the codebase is a task on it's
>>  own
>>
>>  Thanks Peter,
>>  Mohammed
>>
>>  On Mon, May 25, 2015 at 4:38 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>  On Mon, 2015-05-25 at 14:00 -0400, Mohammed Naser wrote:
>>>>  Calling put_prev_task() can result in nr_running being updated
>>>>  to zero, which would then crash the system when the kernel
>>>>  attempts to pick_next_entity() with an empty queue.
>>>  Getting to pick_next_entity() with an empty queue is a bug.

Maybe, we should do global update for all classes. Something like in below patch
(Warning: it's completelly untested).

Though, one problem is update_curr() of fair class does not update the whole
hierarhy. It's for task's entity only.

kernel/sched/core.c | 1 +
kernel/sched/deadline.c | 10 ----------
kernel/sched/rt.c | 7 -------
3 files changed, 1 insertion(+), 17 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e..7717d0b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2821,6 +2821,7 @@ static void __sched __schedule(void)

if (task_on_rq_queued(prev))
update_rq_clock(rq);
+ prev->sched_class->update_curr(rq);

next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7a08d59..7320293 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1102,16 +1102,6 @@ struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
return RETRY_TASK;
}

- /*
- * When prev is DL, we may throttle it in put_prev_task().
- * So, we update time before we check for dl_nr_running.
- */
- if (prev->sched_class == &dl_sched_class)
- update_curr_dl(rq);
-
- if (unlikely(!dl_rq->dl_nr_running))
- return NULL;
-
put_prev_task(rq, prev);

dl_se = pick_next_dl_entity(rq, dl_rq);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38..31223c0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1474,13 +1474,6 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
return RETRY_TASK;
}

- /*
- * We may dequeue prev's rt_rq in put_prev_task().
- * So, we update time before rt_nr_running check.
- */
- if (prev->sched_class == &rt_sched_class)
- update_curr_rt(rq);
-
if (!rt_rq->rt_queued)
return NULL;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/