Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

From: Michael S. Tsirkin
Date: Thu Aug 20 2009 - 09:41:07 EST


On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> > On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > >
> > > > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > > >
> > > > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > > > with another process, it can corrupt your memory.
> > >
> > > How? The point of using user threads is that you only ever access the
> > > address space of the thread that called the ioctl.
> >
> > Think about this example with processes A and B sharing an fd:
> > A does SET_USED_ADDRESS
> > B does SET_USED_ADDRESS
> > A does VHOST_NET_SPLICE
> > See how stuff gets written into a random place in memory of A?
>
> Yes, I didn't think of that. It doesn't seem like a big problem
> though, because it's a clear misuse of the API (I guess your
> current code returns an error for one of the SET_USED_ADDRESS
> ioctls), so I would see it as a classic garbage-in garbage-out
> case.
>
> It may even work in the case that the sharing of the fd resulted
> from a fork, where the address contains the same buffer in both
> processes. I can't think of a reason why you would want to use
> it like that though.

It doesn't matter that I don't want this: allowing 1 process corrupt
another's memory is a security issue. Once you get an fd, you want to
be able to use it without worrying that a bug in another process will
crash yours.

> > > Why would I wake up the threads spuriously? Do you mean for
> > > stopping the transmission or something else? I guess a pthread_kill
> > > would be enough for shutting it down.
> >
> > If you kill and restart them you lost priority etc parameters, but maybe.
>
> If you want to restart it, just send a nonfatal signal (SIGUSR1,
> SIGRTMIN, ...) instead of a SIGKILL.
>
> > > > More importantly, you lose control of CPU locality. Simply put, a
> > > > natural threading model in virtualization is one thread per guest vcpu.
> > > > Asking applications to add multiple helper threads just so they can
> > > > block forever is wrong, IMO, as userspace has no idea which CPU
> > > > they should be on, what priority to use, etc.
> > >
> > > But the kernel also doesn't know this, you get the same problem in
> > > another way. If you have multiple guests running at different priorities,
> > > the kernel will use those priorities to do the more important transfers
> > > first, while with a global workqueue every guest gets the same priority.
> >
> > We could create more threads if this becomes a problem. I just think it
> > should be transparent to userspace. Possibly it's useful to look at the
> > packet header as well to decide on priority: this is something userspace
> > can't do.
>
> Being transparent to user space would be nice, I agree. Letting user space
> choose would also be nice, e.g. if you want to distribute eight available
> hardware queue pairs to three guests in a non-obvious way. The
> implementation depends to some degree on how we want to do multiqueue
> RX/TX in virtio-net in the long run. For best cache locality and NUMA
> behaviour, we might want to have one virtqueue per guest CPU and control
> them independently from the host.
>
> Priorities of the packets are dealt with in the packet scheduler for
> external interfaces, I guess that is sufficient. I'm not sure if we
> need to honor the same priorities for guest-to-guest communication,
> my feeling is that we don't need to.
>
> > > You say that the natural model is to have one thread per guest
> > > CPU,
> >
> > Sorry I was not clear. I think userspace should create thread per guest.
> > We can create as many as we need for networking but I think this should
> > be behind the scenes, so userspace shouldn't bother with host CPUs, it
> > will just get it wrong. Think of CPU hotplug, interrupts migrating
> > between CPUs, etc ...
>
> Yes, I hope we can avoid letting qemu know about host CPUs.

So ... if we need per-host-CPU threads, we'll have to create them
ourselves then.

> I'm not sure we can avoid it completely, because something needs
> to set physical IRQ affinity and such for the virtual devices
> if you want to get the best locality.
>
> > > but you have a thread per host CPU instead. If the numbers
> > > are different, you probably lose either way.
> >
> > The trick I used is to keep as much as possible local
> > TX done on the CPU that runs the guest,
> > RX done on the CPU that runs the NIC interrupt.
> > a smart SMP guest sees which cpu gets interrupts
> > from NIC and schedules RX there, and it shouldn't matter
> > if the numbers of CPUs are different.
>
> yes, that sounds good.
>
> > > It gets worse if you try to apply NUMA policies.
> >
> > I believe the best we could do is avoid switching CPUs
> > until we know the actual destination.
>
> My point is that the RX data in the guest address space
> should be on the same NUMA node that gets the interrupt.

Exactly. And since I know this, I just do the right thing
instead of expecting userspace to do this with taskset and
stuff, which no one seems to get right anyway.

> > > > And note that there might be more than one error. I guess, that's
> > > > another problem with trying to layer on top of vfs.
> > >
> > > Why is that different from any other system call?
> >
> > With other system calls nothing happens while you process the error.
> > Here, the guest (other queues) and the network keep running (unless
> > there is a thread per queue, maybe we can block a queue, but we both
> > agreed above we don't want that).
>
> Well, I would expect error conditions to be fatal for the connections
> normally,

Not necessarily. E.g. one can imagine that userspace wants to handle
access to specific address ranges (framebuffer?) in slow path, and let
kernel handle the rest. A natural way is to make this range generate
access error.

> so blocking the queue is totally fine here IMHO. The ioctl
> would never return while a guest is running and connected to a
> working NIC.
>
> > > We just return when
> > > we hit the first error condition.
> >
> > If you assume losing the code for the second error condition is OK, why
> > is the first one so important? That's why I used a counter (eventfd)
> > per virtqueue, on error userspace can scan the ring and poll the socket
> > and discover what's wrong, and counter ensures we can detect that error
> > happened while we were not looking.
>
> I guess we were talking about different kinds of errors here, and I'm
> still not sure which one you are talking about.
>
> Arnd <><

Non fatal errors. E.g. translation errors probably should be
non-fatal. I can also imagine working around guest bugs in
userspace.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/