Re: [PATCH 03/18] xen/pvcalls: initialize the module and register the xenbus backend

From: Stefano Stabellini
Date: Fri May 19 2017 - 18:33:59 EST


On Thu, 18 May 2017, Stefano Stabellini wrote:
> On Wed, 17 May 2017, Juergen Gross wrote:
> > On 16/05/17 21:58, Stefano Stabellini wrote:
> > > On Tue, 16 May 2017, Juergen Gross wrote:
> > >> On 15/05/17 22:35, Stefano Stabellini wrote:
> > >>> The pvcalls backend has one ioworker per cpu: the ioworkers are
> > >>> implemented as a cpu bound workqueue, and will deal with the actual
> > >>> socket and data ring reads/writes.
> > >>>
> > >>> ioworkers are global: we only have one set for all the frontends. They
> > >>> process requests on their wqs list in order, once they are done with a
> > >>> request, they'll remove it from the list. A spinlock is used for
> > >>> protecting the list. Each ioworker is bound to a different cpu to
> > >>> maximize throughput.
> > >>>
> > >>> Signed-off-by: Stefano Stabellini <stefano@xxxxxxxxxxx>
> > >>> CC: boris.ostrovsky@xxxxxxxxxx
> > >>> CC: jgross@xxxxxxxx
> > >>> ---
> > >>> drivers/xen/pvcalls-back.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
> > >>> 1 file changed, 64 insertions(+)
> > >>>
> > >>> diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
> > >>> index 2dbf7d8..46a889a 100644
> > >>> --- a/drivers/xen/pvcalls-back.c
> > >>> +++ b/drivers/xen/pvcalls-back.c
> > >>> @@ -25,6 +25,26 @@
> > >>> #include <xen/xenbus.h>
> > >>> #include <xen/interface/io/pvcalls.h>
> > >>>
> > >>> +struct pvcalls_ioworker {
> > >>> + struct work_struct register_work;
> > >>> + atomic_t io;
> > >>> + struct list_head wqs;
> > >>> + spinlock_t lock;
> > >>> + int num;
> > >>> +};
> > >>> +
> > >>> +struct pvcalls_back_global {
> > >>> + struct pvcalls_ioworker *ioworkers;
> > >>> + int nr_ioworkers;
> > >>> + struct workqueue_struct *wq;
> > >>> + struct list_head privs;
> > >>> + struct rw_semaphore privs_lock;
> > >>> +} pvcalls_back_global;
> > >>> +
> > >>> +static void pvcalls_back_ioworker(struct work_struct *work)
> > >>> +{
> > >>> +}
> > >>> +
> > >>> static int pvcalls_back_probe(struct xenbus_device *dev,
> > >>> const struct xenbus_device_id *id)
> > >>> {
> > >>> @@ -59,3 +79,47 @@ static int pvcalls_back_uevent(struct xenbus_device *xdev,
> > >>> .uevent = pvcalls_back_uevent,
> > >>> .otherend_changed = pvcalls_back_changed,
> > >>> };
> > >>> +
> > >>> +static int __init pvcalls_back_init(void)
> > >>> +{
> > >>> + int ret, i, cpu;
> > >>> +
> > >>> + if (!xen_domain())
> > >>> + return -ENODEV;
> > >>> +
> > >>> + ret = xenbus_register_backend(&pvcalls_back_driver);
> > >>> + if (ret < 0)
> > >>> + return ret;
> > >>> +
> > >>> + init_rwsem(&pvcalls_back_global.privs_lock);
> > >>> + INIT_LIST_HEAD(&pvcalls_back_global.privs);
> > >>> + pvcalls_back_global.wq = alloc_workqueue("pvcalls_io", 0, 0);
> > >>> + if (!pvcalls_back_global.wq)
> > >>> + goto error;
> > >>> + pvcalls_back_global.nr_ioworkers = num_online_cpus();
> > >>
> > >> Really? Recently I cam across a system with 640 dom0 cpus. I don't think
> > >> we want 640 workers initialized when loading the backend module. I'd
> > >> prefer one or a few workers per connected frontend.
> > >
> > > I think we want to keep the ioworker allocation to be based on the
> > > number of vcpus: we do not want more ioworkers than vcpus because it is
> > > a waste of resources and leads to worse performance. Also, given that
> > > they do memcpy's, I also think it is a good idea to bind them to vcpus
> > > (and pin vcpus to pcpus) to get best performance.
> >
> > This will cause a lot of pain for the cpu offline case. Please don't try
> > to work against the hypervisor scheduler by designing a backend based on
> > a vcpu pin policy. This might result in best performance for your
> > special workload, but generally it is a bad idea!
>
> You are right. Of course, vcpu pinning is not a fundamental requirement
> for this backend. I wrote about vcpu pinning only to help with the
> explanation.
>
> However, pvcalls is a memcpy based protocol and to perform memcpys
> efficiently is very important to keep caches hot. The target is to hit
> the same cacheline when reading and writing, which makes a huge
> difference; it depends on processor and architecture but it is easily
> 20%. To get caching benefits, we need to do memcpys for the same socket
> on the same vcpu (and on the same pcpu as well, that's why I mentioned
> vcpu pinning, but we'll trust the Xen scheduler to do the right thing
> when there is no contention).
>
> This is why in this backend, regardless of the workqueue
> design/allocation we use, I think we have to stick to two basic
> principles:
>
> - each socket is bound to one vcpu
> - sockets are distributed evenly across vcpus

[...]

> It might be possible to create an ioworker for each socket (instead of
> an ioworker for each vcpu) if we wanted to, as long as we bind it to a
> vcpu and distribute them evenly across vcpus.

I don't have access anymore to the large machine I used for the
benchmarks a few months back, but even on my current small testbox
I can see a (small) performance penalty when I don't bind ioworkers to
specific vcpus.

However, using an ioworker per socket, rather than per vcpu, makes the
code much smaller and nicer! Also it makes it much easier to change the
policy between binding, or not binding, ioworkers to specific vcpu.