Re: [PATCH/RFC] Re: recvmmsg() timeout behavior strangeness [RESEND]

From: Arnaldo Carvalho de Melo
Date: Wed May 28 2014 - 11:07:50 EST


Em Wed, May 28, 2014 at 02:20:10PM +0200, Michael Kerrisk (man-pages) escreveu:
> On 05/27/2014 10:30 PM, Arnaldo Carvalho de Melo wrote:
> > attached goes the updated patch, and this is the
> > diff to the last combined one:
> >
> > diff --git a/net/socket.c b/net/socket.c
> > index 310a50971769..379be43879db 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -2478,8 +2478,7 @@ SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
> >
> > datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
> >
> > - if (datagrams > 0 &&
> > - copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
> > + if (copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
> > datagrams = -EFAULT;
> >
> > return datagrams;
> >
> > ------------------------------------------
> >
> > This is a quick thing just to show where the problem lies, need to think
> > how to report an -EFAULT at this point properly, i.e. look at

Ok, so I can live with the way things were before this fix, i.e. if the
user specifies a timeout, then if it fails when copying to remaining
time to userspace (copy_to_user call above), then we return -EFAULT.

I.e. there would be no change in behaviour, but then perhaps we should
go with the interface that is in place when we received some datagrams
and then some error happens, see comment in the existing code, below:

> > __sys_recvmmsg for something related (returning the number of
> > successfully copied datagrams to userspace while storing the error for
> > subsequent reporting):
> >
> > if (err == 0)
> > return datagrams;
> >
> > if (datagrams != 0) {
> > /*
> > * We may return less entries than requested (vlen) if
> > * the sock is non block and there aren't enough
> > * datagrams...
> > */
> > if (err != -EAGAIN) {
> > /*
> > * ... or if recvmsg returns an error after we
> > * received some datagrams, where we record the
> > * error to return on the next call or if the
> > * app asks about it using getsockopt(SO_ERROR).
> > */
> > sock->sk->sk_err = -err;
> > }
> >
> > return datagrams;
> > }
> >
> > I.e. userspace would have to use getsockopt(SO_ERROR)... need to think
> > more about it, sidetracked now, will be back to this.
> >
> > Anyway, attached goes the current combined patch.

> So, I applied against net-next as you suggested offlist.
> Builds and generally tests fine. Some observations:

> * In the case that the call is interrupted by a signal handler and no
> datagrams have been received, the call fails with EINTR, as expected.

Ok

> * The call always updates 'timeout', both in the success case and in the
> EINTR case. (That seems fine.)

Agreed that it is how it should behave.

> But, another question...
>
> In the case that the call is interrupted by a signal handler and some
> datagrams have already been received, then the call succeeds, and
> returns the number of datagrams received, and 'timeout' is updated with
> the remaining time. Maybe that's the right behavior, but I just want to

Note that what the comment in the existing code says should apply here,
namely that the next recv (m or mmsg) syscall on this socket will return
what is in sock->sk->sk_err, that is the signal:

sys_recvmmsg()
sock->ops->recvmsg() (e.g. inet_recvmsg)
sk->prot->recvmsg() (e.g., udp_recvmsg)
skb_recv_datagram()
wait_for_more_packets()
sock_intr_errno()
*err = -EINTR
sock->sk->sk_err = err

Next recv will end up calling skb_recv_datagram and that does:

struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
int *peeked, int *off, int *err, long *timeop)
{
struct sk_buff *skb, *last;
long timeo;
/*
* Caller is allowed not to check sk->sk_err before
* skb_recv_datagram()
*/
int error = sock_error(sk);

if (error)
goto no_packet;
<SNIP>
out:
if (timeop)
*timeop = timeo;
return NULL;

no_packet:
*err = error;
goto out;
}

So, yes, the user _can_ process the packets already copied to userspace,
i.e. no packet loss, and then, on the next call, will receive the signal
notification.

So, the user can just try the next call and see the signal, and it is
also possible to notice that the timeout didn't expire and less than
vlen packets were received, so something went wrong and calling
getsockopt(SO_ERROR) will clarify things.

This is not some new error reporting facility, it predates recvmmsg,
that merely uses it for consistency.

How to properly report the -EFAULT when copying the remaining timeout to
userspace is the special case here, with my patches it will:

. copy n (less than vlen) packets to userspace successfully
. return -EFAULT, not n, just as before the patches being cooked now.

> check. There is at least one other possibility:
>
> * Fetch no datagrams (i.e., the datagrams are left to receive in a
> future call), and the call fails with EINTR, and 'timeout' is updated.

Humm, then we would have to go back to the protocol layer and re-add the
packets to queues, etc, etc, not feasible, I'd say, too much state was
lost already, we would have to have some sort of commit interface
(perhaps using peek, but that gets crazy quickly with multithreaded
apps), guess its a super intrusive path to follow, not worth it, I
think.

> Maybe that possibility is hard to implement (not sure). But my main point
> is to make the current behavior clear, note the alternative, and ask:
> is the current behavior the best choice. (I'm not saying it's not, but I
> do want the choice to be a conscious one.)

Well, my main pet peeve here is how to report that we managed to copy
the datagrams but failed to copy the remaining timeout and then don't
report how many datagrams were successfully copied.

I'm inclined to say that failing to copy the timeout is something so
unlikely, even more since we managed to copy it from userspace to kernel
space at that point in time, that we should keep the current behaviour
and report that something terribly wrong happened, i.e. -EFAULT when
copying the timeout.

Thanks a lot for testing all this, was a pity you were not around when
we first designed and implemented this syscall.

And also it would be really nice if the people in the CC list commented
on this last round of discussion about fixing the timeout behavior.

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/