Re: [RFC PATCH 0/4] cgroup aware workqueues
From: Michael Rapoport
Date: Fri May 27 2016 - 05:22:36 EST
> Tejun Heo <htejun@xxxxxxxxx> wrote on 03/31/2016 08:14:35 PM:
>
> Hello, Michael.
>
> On Thu, Mar 31, 2016 at 08:17:13AM +0200, Michael Rapoport wrote:
> > > There really shouldn't be any difference when using unbound
> > > workqueues. workqueue becomes a convenience thing which manages
> > > worker pools and there shouldn't be any difference between workqueue
> > > workers and kthreads in terms of behavior.
> >
> > I agree that there really shouldn't be any performance difference, but
the
> > tests I've run show otherwise. I have no idea why and I hadn't time
yet to
> > investigate it.
>
> I'd be happy to help digging into what's going on. If kvm wants full
> control over the worker thread, kvm can use workqueue as a pure
> threadpool. Schedule a work item to grab a worker thread with the
> matching attributes and keep using it as it'd a kthread. While that
> wouldn't be able to take advantage of work item flushing and so on,
> it'd still be a simpler way to manage worker threads and the extra
> stuff like cgroup membership handling doesn't have to be duplicated.
>
> > > > opportunity for optimization, at least for some workloads...
> > >
> > > What sort of optimizations are we talking about?
> >
> > Well, if we take Evlis (1) as for the theoretical base, there could be
> > benefit of doing I/O scheduling inside the vhost.
>
> Yeah, if that actually is beneficial, take full control of the
> kworker thread.
It me took a while, but at last I had time to run some benchmarks.
I've compared guest-to-guest netperf with 3 variants of vhost
implementation:
(1) vanilla 4.4 (baseline)
(2) 4.4 + unbound workqueues based on Bandans patches [1]
(3) 4.4 + "grabbed" worker thread. This is my POC implementation that
actually follows your proposal to take full control over the worker
thread.
I've run two guests without any CPU pinning and without any actual
interaction with cgroups
Here's the results (in MBits/sec):
size | 64 | 256 | 1024 | 4096 | 16384
-----+-------+---------+---------+---------+---------
(1) | 496.8 | 1346.31 | 6058.49 | 13736.2 | 13541.4
(2) | 493.3 | 1604.03 | 5723.68 | 10181.4 | 15572.4
(3) | 489.7 | 1437.86 | 6251.12 | 12774.2 | 12867.9
>From what I see, for different packet sizes there's different approach
that outperforms the others.
Moreover, I'd expect that in case when vhost completely takes over the
worker thread there would no be difference vs. current state.
Tejun, can you help explaining these results?
[1] http://thread.gmane.org/gmane.linux.network/286858