You miss my point I think. Using ioctls *requires* a thread per-vcpu in userspace. This is unnecessary since you could simply provide a char-device based read/write interface. You could then multiplex events and poll.
Yes, ioctl()s require userspace threads, but that's okay, because they're free for us, since we need a kernel thread for each vcpu.
On the other hand, a single device model thread polling the vcpus is guaranteed to be on the wrong physical cpu for half of the time (assuming 2 cpus and 2 vcpus), requiring IPIs and suspending a vcpu in order to run.
And your previously proposed solution of having one big lock would do the same thing except require additional round trips to the kernel :-)
Moreover, you could get clever and use mmap() to expose a ring queue if you're really concerned about SMP.
Really though, it comes down to one simple thing: blocking ioctl()s are a real ugly interface.
If for nothing else, you have to be able to run timers in userspace and interrupt the kernel execution (to signal DMA completion for instance). Even in the UP case, this gets ugly quickly.
The timers aren't pretty (we use signals), yes. But avoiding the extra thread is critical for performance IMO.
We've had a lot of problems in QEMU with timers and kqemu. Forcing the guest to return to userspace to allow periodic timers to run (which may simply be the VGA refresh which the guest doesn't care about) is at best a hack.
Being able to poll an FD would make this so much nicer...
I've posted some patches on qemu-devel attempting to deal with these issues (look for threads on optimizing char device performance). None of them are very pretty.