Re: [PATCH] net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL

From: Pavel Begunkov
Date: Tue Mar 19 2024 - 09:58:40 EST


On 3/19/24 10:50, Sascha Hauer wrote:
On Mon, Mar 18, 2024 at 01:19:19PM +0000, Pavel Begunkov wrote:
On 3/18/24 12:10, Sascha Hauer wrote:
On Fri, Mar 15, 2024 at 05:02:05PM +0000, Pavel Begunkov wrote:
On 3/15/24 10:01, Sascha Hauer wrote:
It can happen that a socket sends the remaining data at close() time.
With io_uring and KTLS it can happen that sk_stream_wait_memory() bails
out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the
current task. This flag has been set in io_req_normal_work_add() by
calling task_work_add().

The entire idea of task_work is to interrupt syscalls and let io_uring
do its job, otherwise it wouldn't free resources it might be holding,
and even potentially forever block the syscall.

I'm not that sure about connect / close (are they not restartable?),
but it doesn't seem to be a good idea for sk_stream_wait_memory(),
which is the normal TCP blocking send path. I'm thinking of some kinds
of cases with a local TCP socket pair, the tx queue is full as well
and the rx queue of the other end, and io_uring has to run to receive
the data.

There is another case, let's say the IO is done via io-wq
(io_uring's worker thread pool) and hits the waiting. Now the
request can't get cancelled, which is done by interrupting the
task with TIF_NOTIFY_SIGNAL. User requested request cancellations
is one thing, but we'd need to check if io_uring can ever be closed
in this case.


If interruptions are not welcome you can use different io_uring flags,
see IORING_SETUP_COOP_TASKRUN and/or IORING_SETUP_DEFER_TASKRUN.

I tried with different combinations of these flags. For example
IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN
makes the issue less likely, but nevertheless it still happens.

However, reading the documentation of these flags, they shall provide
hints to the kernel for optimizations, but it should work without these
flags, right?

That's true, and I guess there are other cases as well, like
io-wq and perhaps even a stray fput.


Maybe I'm missing something, why not restart your syscall?

The problem comes with TLS. Normally with synchronous encryption all
data on a socket is written during write(). When asynchronous
encryption comes into play, then not all data is written during write(),
but instead the remaining data is written at close() time.

Was it considered to do the final cleanup in workqueue
and only then finalising the release?

No, but I don't really understand what you mean here. Could you
elaborate?

The suggestion is instead of executing the release and that final
flush off of the context you're in, namely userspace task,
you can spin up a kernel task (they're not getting any signals)
and execute it from there.

void deferred_release_fn(struct work_struct *work)
{
do_release();
...
}

struct work_struct work;
INIT_WORK(&work, deferred_release_fn);
queue_work(system_unbound_wq, &work);


There is a catch. Even though close() is not obliged to close
the file / socket immediately, but it still not nice when you
drop the final ref but port and other bits are not released
until some time after. So, you might want to wait for that
deferred release to complete before returning to the
userspace.

I'm assuming it's fine to run it by a kernel task since
IIRC fput might delay release to it anyway, but it's better
to ask net maintainers. In theory it shouldn't need
mm,fs,etc that user task would hold.

--
Pavel Begunkov