[GIT PULL] workqueue changes for v3.10-rc1

From: Tejun Heo
Date: Mon Apr 29 2013 - 20:00:31 EST

Hello, Linus.

A lot of activities on workqueue side this time. The changes achieve
the followings.

* WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.

* The ability to interface with multiple backend worker pools are used
to implement unbound workqueues with custom attributes. Currently
the supported attributes are the nice level and CPU affinity. It
may be expanded to include cgroup association in future. The
attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.

The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools

This allows converting custom worker pool implementations which want
worker attribute tuning to use workqueues. The writeback pool is
already converted in block tree and there are a couple others are
likely to follow including btrfs io workers.

* WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.

After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.

Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.

While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.

As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.

While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online

There are also a couple things which are being routed outside the
workqueue tree.

* block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

* The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree.


The following changes since commit 07961ac7c0ee8b546658717034fe692fd12eefa9:

Linux 3.9-rc5 (2013-03-31 15:12:43 -0700)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.10

for you to fetch changes up to cece95dfe5aa56ba99e51b4746230ff0b8542abd:

workqueue: use kmem_cache_free() instead of kfree() (2013-04-09 11:33:40 -0700)

