[PATCH net 00/19] rxrpc: Fix race between call connection, data transmit and call disconnect
From: David Howells
Date: Sat Jan 07 2023 - 04:54:24 EST
Here are patches to fix an oops[1] caused by a race between call
connection, initial packet transmission and call disconnection which
results in something like:
kernel BUG at net/rxrpc/peer_object.c:413!
when the syzbot test is run. The problem is that the connection procedure
is effectively split across two threads and can get expanded by taking an
interrupt, thereby adding the call to the peer error distribution list
*after* it has been disconnected (say by the rxrpc socket shutting down).
The easiest solution is to look at the fourth set of I/O thread
conversion/SACK table expansion patches that didn't get applied[2] and take
from it those patches that move call connection and disconnection into the
I/O thread. Moving these things into the I/O thread means that the
sequencing is managed by all being done in the same thread - and the race
can no longer happen.
This is preferable to introducing an extra lock as adding an extra lock
would make the I/O thread have to wait for the app thread in yet another
place.
The changes can be considered as a number of logical parts:
(1) Move all of the call state changes into the I/O thread.
(2) Make client connection ID space per-local endpoint so that the I/O
thread doesn't need locks to access it.
(3) Move actual abort generation into the I/O thread and clean it up. If
sendmsg or recvmsg want to cause an abort, they have to delegate it.
(4) Offload the setting up of the security context on a connection to the
thread of one of the apps that's starting a call. We don't want to be
doing any sort of crypto in the I/O thread.
(5) Connect calls (ie. assign them to channel slots on connections) in the
I/O thread. Calls are set up by sendmsg/kafs and passed to the I/O
thread to connect. Connections are allocated in the I/O thread after
this.
(6) Disconnect calls in the I/O thread.
I've also added a patch for an unrelated bug that cropped up during
testing, whereby a race can occur between an incoming call and socket
shutdown.
Note that whilst this fixes the original syzbot bug, another bug may get
triggered if this one is fixed:
INFO: rcu detected stall in corrupted
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P5792 } 2657 jiffies s: 2825 root: 0x0/T
rcu: blocking rcu_node structures (internal RCU debug):
It doesn't look this should be anything to do with rxrpc, though, as I've
tested an additional patch[3] that removes practically all the RCU usage
from rxrpc and it still occurs. It seems likely that it is being caused by
something in the tunnelling setup that the syzbot test does, but there's
not enough info to go on. It also seems unlikely to be anything to do with
the afs driver as the test doesn't use that.
The patches are tagged here:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-fixes-20230107
and can also be found on the following branch:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes
David
Link: https://syzkaller.appspot.com/bug?extid=c22650d2844392afdcfd [1]
Link: https://lore.kernel.org/r/167034231605.1105287.1693064952174322878.stgit@xxxxxxxxxxxxxxxxxxxxxx/ [2]
Link: https://lore.kernel.org/r/1278570.1673042093@xxxxxxxxxxxxxxxxxxxxxx/ [3]
---
David Howells (19):
rxrpc: Stash the network namespace pointer in rxrpc_local
rxrpc: Make the local endpoint hold a ref on a connected call
rxrpc: Separate call retransmission from other conn events
rxrpc: Only set/transmit aborts in the I/O thread
rxrpc: Only disconnect calls in the I/O thread
rxrpc: Implement a mechanism to send an event notification to a connection
rxrpc: Clean up connection abort
rxrpc: Tidy up abort generation infrastructure
rxrpc: Make the set of connection IDs per local endpoint
rxrpc: Offload the completion of service conn security to the I/O thread
rxrpc: Set up a connection bundle from a call, not rxrpc_conn_parameters
rxrpc: Split out the call state changing functions into their own file
rxrpc: Wrap accesses to get call state to put the barrier in one place
rxrpc: Move call state changes from sendmsg to I/O thread
rxrpc: Move call state changes from recvmsg to I/O thread
rxrpc: Remove call->state_lock
rxrpc: Move the client conn cache management to the I/O thread
rxrpc: Move client call connection to the I/O thread
rxrpc: Fix incoming call setup race
Documentation/networking/rxrpc.rst | 4 +-
fs/afs/cmservice.c | 6 +-
fs/afs/rxrpc.c | 24 +-
include/net/af_rxrpc.h | 3 +-
include/trace/events/rxrpc.h | 160 +++++--
net/rxrpc/Makefile | 1 +
net/rxrpc/af_rxrpc.c | 27 +-
net/rxrpc/ar-internal.h | 212 ++++++---
net/rxrpc/call_accept.c | 57 ++-
net/rxrpc/call_event.c | 86 +++-
net/rxrpc/call_object.c | 116 +++--
net/rxrpc/call_state.c | 69 +++
net/rxrpc/conn_client.c | 709 ++++++++---------------------
net/rxrpc/conn_event.c | 382 ++++++----------
net/rxrpc/conn_object.c | 67 ++-
net/rxrpc/conn_service.c | 1 -
net/rxrpc/input.c | 175 +++----
net/rxrpc/insecure.c | 20 +-
net/rxrpc/io_thread.c | 204 +++++----
net/rxrpc/local_object.c | 35 +-
net/rxrpc/net_ns.c | 17 -
net/rxrpc/output.c | 60 ++-
net/rxrpc/peer_object.c | 23 +-
net/rxrpc/proc.c | 17 +-
net/rxrpc/recvmsg.c | 256 +++--------
net/rxrpc/rxkad.c | 356 ++++++---------
net/rxrpc/rxperf.c | 17 +-
net/rxrpc/security.c | 53 +--
net/rxrpc/sendmsg.c | 195 ++++----
29 files changed, 1592 insertions(+), 1760 deletions(-)
create mode 100644 net/rxrpc/call_state.c