Am 16.05.2013 23:53, schrieb Peter Hurley:I expect this behaviour depends on the remote device. Does the device close the RFCOMM session cleanly ? Perhaps an out of range test would be better so that the connection drops.
And the tty layer can't really _prevent_ the tty driver from mishandlingAs described before, it ends up with memory corruption because freed
the port kref.
Especially since it seemed to have been worked before tty_ports gotWell, at the time tty_port was introduced to RFCOMM, there was nothing
introduced.
to tear-down in tty_port. Now that tty_port owns the flip buffers and
must do proper tear-down, the problem has surfaced.
But I can't add much more to this discussion, as I'm rather a noviceOk. Could you paste the BUG() and steps to reproduce?
in regard to the tty subsystem. I even don't know much about the task
sharing between tty, tty_port and tty_ldisc, except the stuff I found
out because I got hit by that bug and therefor have read some of the
sources.
I have a plan to fix it but I'd like to review what you have
first.
memory is used, so if a BUG() happens, it doesn't help much. E.g. with
kernel 3.9.2 I never have seen a bug, just a rebooting machine
(sometimes minutes after the real bug happened).
To reproduce it, call rfcomm connect /dev/rfcommN and after the
connection to the remote device happened, power down the remote device
and wait 20s (the timeout until a connection drop will be discovered).
Furthermore I would suggest to use commit ecbbfd4, because of the aboveI have experienced this crash on Mageia 3 kernel 3.8.13 on my laptop. I was using SLIP over a RFCOMM connection between my laptop and an ARM board. I was using ssh sessions on my laptop over Ethernet to control Bluetooth on the ARM board. If my KDE session dies, then the ssh sessions die which kills any slattach and rfcomm programs on my ARM target.
mentioned problem. With that you might have luck and see a BUG like this:
May 16 00:06:18 laptopahvpn kernel: [ 51.238969] ------------[ cut
here ]------------
May 16 00:06:18 laptopahvpn kernel: [ 51.241754] kernel BUG at
kernel/workqueue.c:609!
May 16 00:06:18 laptopahvpn kernel: [ 5.603591] error attempted to
write to tty [0x (null)] = NULL
May 16 00:06:18 laptopahvpn kernel: [ 51.244131] invalid opcode: 0000
[#1] SMP
May 16 00:06:18 laptopahvpn kernel: [ 51.249491] Modules linked in:
sch_sfq cdc_acm msr nfs lockd sunrpc rfcomm bnep iptable_nat nf_na
t_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_recent xt_conntrack
nf_conntrack iptable_filter xt_LOG xt_limit ip6table_filter ip6_ta
bles ipv6 btusb bluetooth snd_hda_codec_hdmi coretemp kvm_intel
snd_hda_codec_realtek arc4 kvm crc32c_intel iwldvm ghash_clmulni_intel
mac80211 aesni_intel aes_x86_64 ablk_helper cryptd samsung_laptop xts
lrw gf128mul iwlwifi microcode cfg80211 xhci_hcd rfkill snd_hda_intel
snd_hda_codec snd_hwdep snd_pcm ehci_hcd snd_page_alloc snd_timer snd
usbcore soundcore lpc_ich usb_common mfd_core joydev
May 16 00:06:18 laptopahvpn kernel: [ 51.261073] CPU 1
May 16 00:06:18 laptopahvpn kernel: [ 51.261106] Pid: 2449, comm:
rfcomm Not tainted 3.7.0-rc2-00023-gecbbfd4-dirty #208 SAMSUNG
ELECTRONICS CO., LTD. 900X3C/900X3D/900X4C/900X4D/SAMSUNG_NP1234567890
May 16 00:06:18 laptopahvpn kernel: [ 51.266958] RIP:
0010:[<ffffffff810492fe>] [<ffffffff810492fe>] get_work_gcwq+0x5e/0x60
May 16 00:06:18 laptopahvpn kernel: [ 51.270064] RSP:
0018:ffff88020f253da0 EFLAGS: 00010016
May 16 00:06:18 laptopahvpn kernel: [ 51.273155] RAX: ffffffff81931380
RBX: ffff880214fee400 RCX: 0000000000000024
May 16 00:06:18 laptopahvpn kernel: [ 51.276270] RDX: 007fffc4010a7f73
RSI: 0000000000000000 RDI: ffff880214fee400
May 16 00:06:18 laptopahvpn kernel: [ 51.279333] RBP: 0000000000000000
R08: 000000000000000a R09: 000000000000181c
May 16 00:06:18 laptopahvpn kernel: [ 51.282319] R10: 0000000000000000
R11: 000000000000181b R12: 0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.285286] R13: 0000000000000004
R14: ffff880210863000 R15: 0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.288265] FS:
00007f8bd6e94700(0000) GS:ffff88021f280000(0000) knlGS:0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.291283] CS: 0010 DS: 0000
ES: 0000 CR0: 0000000080050033
May 16 00:06:18 laptopahvpn kernel: [ 51.294328] CR2: 00007fc249111e60
CR3: 000000020f1d3000 CR4: 00000000001407e0
May 16 00:06:18 laptopahvpn kernel: [ 51.297415] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.300506] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 16 00:06:18 laptopahvpn kernel: [ 51.303555] Process rfcomm (pid:
2449, threadinfo ffff88020f252000, task ffff880210bbee80)
May 16 00:06:18 laptopahvpn kernel: [ 51.306638] Stack:
May 16 00:06:18 laptopahvpn kernel: [ 51.309704] ffffffff8104a471
0000000000014040 0000000000000296 ffff880210863000
May 16 00:06:18 laptopahvpn kernel: [ 51.312850] 0000000000000000
0000000000000001 ffffffff81258188 0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.315998] ffffffff812591b4
0000000000013fc0 ffff880215278700 0000000000000000
May 16 00:06:18 laptopahvpn kernel: [ 51.319139] Call Trace:
May 16 00:06:18 laptopahvpn kernel: [ 51.322236] [<ffffffff8104a471>]
? __cancel_work_timer+0x31/0xa0
May 16 00:06:18 laptopahvpn kernel: [ 51.325398] [<ffffffff81258188>]
? tty_ldisc_halt+0x18/0x20
May 16 00:06:18 laptopahvpn kernel: [ 51.328551] [<ffffffff812591b4>]
? tty_ldisc_release+0x34/0x110
May 16 00:06:18 laptopahvpn kernel: [ 51.331719] [<ffffffff81251dbc>]
? tty_release+0x4ac/0x520
May 16 00:06:18 laptopahvpn kernel: [ 51.334873] [<ffffffff810f2161>]
? __fput+0xe1/0x230
May 16 00:06:18 laptopahvpn kernel: [ 51.338030] [<ffffffff8104d75f>]
? task_work_run+0x8f/0xd0
May 16 00:06:18 laptopahvpn kernel: [ 51.341208] [<ffffffff81002919>]
? do_notify_resume+0x69/0xc0
May 16 00:06:18 laptopahvpn kernel: [ 51.344383] [<ffffffff8104d649>]
? task_work_add+0x49/0x60
May 16 00:06:18 laptopahvpn kernel: [ 51.347578] [<ffffffff81422e1a>]
? int_signal+0x12/0x17
May 16 00:06:18 laptopahvpn kernel: [ 51.350777] Code: d5 a0 d3 85 81
c3 0f 1f 80 00 00 00 00 31 c0 66 0f 1f 44 00 00 f3 c3 66 0f 1f 44 00 00
30 c0 48 8b 00 48 8b 00 c3 83 fa 04 74 ea <0f> 0b e8 9b ff ff ff ba 05
00 00 00 48 85 c0 74 03 8b 50 04 89
May 16 00:06:18 laptopahvpn kernel: [ 51.358380] RIP
[<ffffffff810492fe>] get_work_gcwq+0x5e/0x60
May 16 00:06:18 laptopahvpn kernel: [ 51.362070] RSP <ffff88020f253da0>
May 16 00:06:18 laptopahvpn kernel: [ 51.365766] ---[ end trace
f2ccc5bea5182396 ]---
But only fixing the problem with rewriting rfcomm/tty.c but without any
explanations about the expected lifetime of tty_port doesn't help much.
As proved the switch to tty_port has some pitfalls and even people with
a deeper insight into the new tty layer entered them.
E.g. the fact that tty_port is self-destructing suggests the conclusion
that the problem isn't in rfcomm, but in tty_release() (that's why I
placed the wrong workaround there).
So without at least some small clarification about the expected lifetime
of tty_port, it's likely someone else will enter the same pit (which
unfortunately isn't seen that easy and a BUG() doesn't have to happen).
In include/linux/tty.h is just
"The tty port has a different lifetime to the tty so must be kept apart."
As it isn't specified that tty_port has to live as long as tty, I would
(again) conclude it could have a shorter livetime than tty. Maybe
someone can clarify that statement there.
I assume I would be able to fix the problem in rfcomm myself, if someone
would offer me an explanation about the expected lifetime of tty_port
and some confirmation, that the call of tty_ldisc_release() in
tty_release() isn't the real problem.
E.g. why isn't that call to tty_ldisc_release() in tty_port_destructor()
or in tty_port_destroy()? If it would be there the problem (and one
pitfall) would be gone too. struct tty_port seems to have a pointer to
tty (even two, tty and itty), so calling tty_ldisc_release() in
tty_port_destroy() looks possible.