Re: [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt()

From: Stefan Metzmacher
Date: Tue Apr 01 2025 - 17:21:32 EST


Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
On 04/01, Breno Leitao wrote:
On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
On 03/31, Stefan Metzmacher wrote:
The motivation for this is to remove the SOL_SOCKET limitation
from io_uring_cmd_getsockopt().

The reason for this limitation is that io_uring_cmd_getsockopt()
passes a kernel pointer as optlen to do_sock_getsockopt()
and can't reach the ops->getsockopt() path.

The first idea would be to change the optval and optlen arguments
to the protocol specific hooks also to sockptr_t, as that
is already used for setsockopt() and also by do_sock_getsockopt()
sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().

But as Linus don't like 'sockptr_t' I used a different approach.

@Linus, would that optlen_t approach fit better for you?

[..]

Instead of passing the optlen as user or kernel pointer,
we only ever pass a kernel pointer and do the
translation from/to userspace in do_sock_getsockopt().

At this point why not just fully embrace iov_iter? You have the size
now + the user (or kernel) pointer. Might as well do
s/sockptr_t/iov_iter/ conversion?

I think that would only be possible if we introduce
proto[_ops].getsockopt_iter() and then convert the implementations
step by step. Doing it all in one go has a lot of potential to break
the uapi. I could try to convert things like socket, ip and tcp myself, but
the rest needs to be converted by the maintainer of the specific protocol,
as it needs to be tested. As there are crazy things happening in the existing
implementations, e.g. some getsockopt() implementations use optval as in and out
buffer.

I first tried to convert both optval and optlen of getsockopt to sockptr_t,
and that showed that touching the optval part starts to get complex very soon,
see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
(note it didn't converted everything, I gave up after hitting
sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
more are the ones also doing both copy_from_user and copy_to_user on optval)

I come also across one implementation that returned -ERANGE because *optlen was
too short and put the required length into *optlen, which means the returned
*optlen is larger than the optval buffer given from userspace.

Because of all these strange things I tried to do a minimal change
in order to get rid of the io_uring limitation and only converted
optlen and leave optval as is.

In order to have a patchset that has a low risk to cause regressions.

But as alternative introducing a prototype like this:

         int (*getsockopt_iter)(struct socket *sock, int level, int optname,
                                struct iov_iter *optval_iter);

That returns a non-negative value which can be placed into *optlen
or negative value as error and *optlen will not be changed on error.
optval_iter will get direction ITER_DEST, so it can only be written to.

Implementations could then opt in for the new interface and
allow do_sock_getsockopt() work also for the io_uring case,
while all others would still get -EOPNOTSUPP.

So what should be the way to go?

Ok, I've added the infrastructure for getsockopt_iter, see below,
but the first part I wanted to convert was
tcp_ao_copy_mkts_to_user() and that also reads from userspace before
writing.

So we could go with the optlen_t approach, or we need
logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
with ITER_DEST...

So who wants to decide?

I just noticed that it's even possible in same cases
to pass in a short buffer to optval, but have a longer value in optlen,
hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.

This makes it really hard to believe that trying to use iov_iter for this
is a good idea :-(

That was my finding as well a while ago, when I was planning to get the
__user pointers converted to iov_iter. There are some weird ways of
using optlen and optval, which makes them non-trivial to covert to
iov_iter.

Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
of useful socket opts. See if there are any obvious problems with them
and if not, try converting. The rest we can cover separately when/if
needed.

That's what I tried, but it fails with
tcp_getsockopt ->
do_tcp_getsockopt ->
tcp_ao_get_mkts ->
tcp_ao_copy_mkts_to_user ->
copy_struct_from_sockptr
tcp_ao_get_sock_info ->
copy_struct_from_sockptr

That's not possible with a ITER_DEST iov_iter.

metze