Re: 3.17-rc1 oops during network interface configuration

From: Or Gerlitz
Date: Wed Sep 10 2014 - 03:48:33 EST



On 9/9/2014 10:30 PM, Chuck Lever wrote:
This crash happens when booting v3.17-rcN on any of my IB-enabled
systems. I have both ConnectX-2 and mthca systems, all are affected.

I bisected this to:

commit e0f31d8498676fda36289603a054d0d490aa2679
Author: Govindarajulu Varadarajan <_govind@xxxxxxx>
AuthorDate: Mon Jun 23 16:07:58 2014 +0530
Commit: David S. Miller <davem@xxxxxxxxxxxxx>
CommitDate: Mon Jun 23 14:32:19 2014 -0700

flow_keys: Record IP layer protocol in skb_flow_dissect()

skb_flow_dissect() dissects only transport header type in ip_proto. It dose not
give any information about IPv4 or IPv6.
This patch adds new member, n_proto, to struct flow_keys. Which records the
IP layer type. i.e IPv4 or IPv6.
This can be used in netdev->ndo_rx_flow_steer driver function to dissect flow.
Adding new member to flow_keys increases the struct size by around 4 bytes.
This causes BUILD_BUG_ON(sizeof(qcb->data) < sz); to fail in
qdisc_cb_private_validate()
So increase data size by 4

Signed-off-by: Govindarajulu Varadarajan <_govind@xxxxxxx>
Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>


This commit includes a hunk that increases the size of struct qdisc_skb_cb
by at least 4 bytes:

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 624f985..a3cfb8e 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -231,7 +231,7 @@ struct qdisc_skb_cb {
unsigned int pkt_len;
u16 slave_dev_queue_mapping;
u16 _pad;
- unsigned char data[20];
+ unsigned char data[24];
};
static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)


IPoIB defines the following structure in drivers/infiniband/ulp/ipoib/ipoib.h:

struct ipoib_cb {
struct qdisc_skb_cb qdisc_cb;
u8 hwaddr[INFINIBAND_ALEN];
};
IPoIB keeps this in the sk_buff:cb field, which is exactly 48 bytes.
After commit e0f31d84, the size of struct ipoib_cb on x86_64 becomes
52 bytes.

Thus IPoIB overruns sk_buff:cb, and trashes the sk_buff::_skb_refdst
field, which contains a pointer. By the time we get into
__dev_queue_xmit() and try to use the result of skb_dst(), that pointer
is garbage, and we oops.

Obviously, cb[] could be increased to 56 bytes to accommodate struct
ipoib_cb. I tried this, and it is effective in preventing the oops on
one of my systems.

But I suspect there is an historical reason I’m not aware of that it
has remained 48 bytes for years.

Hi Chuck, thanks for bisecting this out. Indeed, as of this kernel 3.2 commit 936d7de "IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses" we are using the skb->cb field to enable proper work under GRO and avoid another historical quirk we had there... so I think we can definetly consider commit e0f31d849 to introduce a severe regression... Govindarajulu, Dave - what's your thinking here? any quick idea on how to fix?

Also, I was thinking we have the mechanics in the kernel, e.g commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound explicit.") to catch such over-flows?

Or.

BUG: unable to handle kernel paging request at ffff88090000007e
IP: __dev_queue_xmit+0x519
Call Trace:
? __dev_queue_xmit+0x49
dev_queue_xmit+0x10
neigh_connected_output
? ip_finish_output
ip_finish_output
? ip_finish_output
? netif_rx_ni
ip_mc_output
ip_local_out_sk
ip_send_skb
udp_send_skb
udp_sendmsg
? ip_reply_glue_bits
? __lock_is_held
inet_sendmsg
? inet_sendmsg
sock_sendmsg
? might_fault
? might_fault
? move_addr_to_kernel.part.38
SYSC_sendto
? sysret_check
? trace_hardirqs_on_caller
? trace_hardirqs_on_thunk
SyS_sendto
system_call_fastpath

Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
drm_kms_helper: panic occurred, switching back to text console

A screenshot of this kernel oops can be found here:
https://drive.google.com/file/d/0B1YQOreL3_FxVDB5UTNwekF6LVU/

gdb translates the crash address into the following (not sure this makes sense
since offset 0x519 is past the end of __dev_queue_xmit()):

(gdb) list *(__dev_queue_xmit+0x519)
0xffffffff8136bc89 is in netdev_adjacent_rename_links (net/core/dev.c:5167).
5162 void netdev_adjacent_rename_links(struct net_device *dev, char *oldname)
5163 {
5164 struct netdev_adjacent *iter;
5165
5166 list_for_each_entry(iter, &dev->adj_list.upper, list) {
5167 netdev_adjacent_sysfs_del(iter->dev, oldname,
5168 &iter->dev->adj_list.lower);
5169 netdev_adjacent_sysfs_add(iter->dev, dev,
5170 &iter->dev->adj_list.lower);
5171 }

And the address __dev_queue_xmit+0x49 is translated by gdb into:

(gdb) list *(__dev_queue_xmit+0x49)
0xffffffff8136b7b9 is in __dev_queue_xmit (./arch/x86/include/asm/preempt.h:75).
70 * The various preempt_count add/sub methods
71 */
72
73 static __always_inline void __preempt_count_add(int val)
74 {
75 raw_cpu_add_4(__preempt_count, val);
76 }
77
78 static __always_inline void __preempt_count_sub(int val)
79 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/