Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

From: Linus Torvalds
Date: Fri Jan 13 2017 - 14:34:02 EST


On Fri, Jan 13, 2017 at 3:18 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> in case when pipe->nrbufs == pipe->buffers and idx == pipe->curbuf. IOW,
> we have a full pipe and want to empty it entirely; the fun question is,
> of course, telling that case from having nothing to free with the same
> full pipe...

That's just because you're looking at all the wrong fields.

Doing this:

int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

and then comparing the current index to the unused index is just
complete garbage. Exactly because the "nrbufs % pipe->buffers"
ambiguity when nrbufs is either 0 or full.

So that comparison is actively crap, exactly *because* it has already
thrown all the actual data away.

You must *never* compare indexes in a circular buffer like this. It's
simply wrong.

Your patch might work, but it just keeps on doing the wrong thing and
then works around the fact that it does the wrong thing by adding more
disgusting hacks.

What am I missing?

Why is "pipe_advance()" written in that incomprehensible form? Why
don't we do the pipe_buf_release() as we advance through it, instead
of doing it at the end?

Also, the line

buf->len = size;

in that pipe_advance() function looks buggy. "size" is how much we
*remove* from buf->len, shouldn't we update buf->len by subtracting
size?

This function looks so broken that I must be missing something. Why
doesn't pipe_advance() just look like the following:

static void pipe_advance(struct iov_iter *i, size_t size)
{
struct pipe_inode_info *pipe = i->pipe;
struct pipe_buffer *buf;
int idx = i->idx;

if (unlikely(i->count < size))
size = i->count;
if (!size)
return;

i->count -= size;

while (pipe->nrbufs) {
buf = &pipe->bufs[idx];

/* We are left with a partial buffer.. */
if (size < buf->len) {
buf->len -= size;
buf->offset += size;
i->idx = idx;
i->iov_offset = buf->offset;
return;
}

size -= buf->len;
pipe_buf_release(pipe, buf);
pipe->nrbufs--;
idx = next_idx(idx, pipe);
}

/* We have exhausted all pipe buffers, reset everything */
pipe->curbuf = 0;
i->idx = 0;
i->iov_offset = 0;
i->count = 0;

/* Did we try to advace past that? */
WARN_ON_ONCE(size);
}

which looks a lot more straightforward, freeing the buffers as it goes
along, and doesn't screw around comparing indexes?

But as I said, I'm probably missing something, because the current
"pipe_advance()" function looks so incomprehensible to me. So I'm
assuming there's some reason for the craziness.

Linus