Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

From: Tejun Heo
Date: Fri May 19 2023 - 18:35:35 EST


Hello, Linus.

On Thu, May 18, 2023 at 05:41:29PM -0700, Linus Torvalds wrote:
...
> That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
> read workqueue") really talks about startup costs. They are about
> things like "page in the executable", which is all almost 100%
> serialized with no parallelism at all. Even read-ahead ends up being
> serial, in that it's likely one single contiguous IO.

I should have explained my thought process better here. I don't think the
fsverity and other similar recent reports on heterogeneous ARM CPUs are
caused directly by workqueue. Please take a look at the following message
from Brian Norris.

https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@xxxxxxxxxx/T/#u

While it's difficult to tell anything definitive from the message, the
reported performance ordering is

4.19 > pinning worker to a CPU >> fixating CPU freq > kthread_worker

where the difference between 4.19 and pinning to one CPU is pretty small.
So, it does line up with other reports in that the source of higher
latencies and lower performance are from work items getting sprayed across
CPUs.

However, the two kernel versions tested are 4.19 and 5.15. I audited the
commits in-between and didn't spot anything which would materially change
unbound workers' affinities or how unbound work items would be assigned to
them.

Also, f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read
workqueue") is reporting 30 fold increase in scheduler latency, which I take
to be the time from work item being queued to start of exectuion. That's
unlikely to be just from crossing a cache boundary. There must be other
interactions (e.g. with some powersaving state transitions).

That said, workqueue, by spraying work items across cache boundaries, does
provide a condition in which this sort of problems can be significantly
amplified. workqueue isn't and can't fix the root cause of these problems;
however,

* The current workqueue behavior is silly on machines with multiple L3
caches such as recent AMD CPUs w/ chiplets and heterogeneous ARM ones.
Improving workqueue locality is likely to lessen the severity of the
recently reported latency problems, possibly to the extent that it won't
matter anymore.

* The fact that the remedy people are going for is switching to percpu
workqueue is bad. That is going to hurt other use cases, often severely,
and their only solution would be reverting back to unbound workqueues.

So, my expectation with the posted patchset is that most of the severe
chrome problems will go away, hopefully. Even in the case that that's not
sufficient, unbound workqueues now provide enough flexibility to work around
these problems without resorting to switching to per-cpu workqueues thus
avoiding affecting other use cases negatively.

The performance benchmarks are not directed towards the reported latency
problems. The reported magnitude seems very hardware specific to me and I
have no way of reproducing. The benchmarks are more to justify switching the
default boundaries from NUMA to cache.

> Yes, latency tends to be harder to benchmark than throughput, but I
> really think latency trumps throughput 95% of the time. And all your
> benchmark loads looked like throughput loads to me: they just weren't
> using *all* the CPU capacity you had.
>
> Yes, writeback can have lovely throughput behavior and saturate the IO
> because you have lots of parallelism. But reads are often 100% serial
> for one thread, and often you don't *have* more than one thread.
>
> So I think your "not enough work to saturate" is still ludicrously
> over-the-top. You should not aim for "not enough work to saturate 24
> threads". You should aim for "basically completely single-threaded".
> Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
> by at least an order of magnitude.


More Experiments
================

I ran several more test sets. An extra workqueue configuration CPU+STRICT is
added which is very similar to using CPU_INTENSIVE per-cpu workqueues. Also,
CPU util is now per-single-CPU, ie. 100% indicates using one full CPU rather
than all CPUs because otherwise the numbers were too small. The tests are
run against dm-crypt on a tmpfs backed loop device.


4. LLOW

Similar benchmark as before but --bs is now 512 bytes and --iodepth and
--numjobs are dialed down to 4 and 1, respectively. Compared to LOW, both IO
size and concurrency are 64 times lower.

taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
--ioengine=libaio --iodepth=4 --runtime=60 --numjobs=1 \
--time_based --group_reporting --name=iops-test-job --verify=sha512

Note that fio is pinned to one CPU because otherwise I was getting random
bi-modal behaviors depending on what the scheduler was doing making
comparisons difficult.

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 55.08 ±0.29 290.80 ±0.64 0.00
CACHE 64.42 ±0.47 291.51 ±0.30 +16.96
CACHE+STRICT 65.18 ±1.14 303.74 ±1.79 +18.34
CPU+STRICT 32.86 ±0.34 159.05 ±0.37 -48.99
SYSTEM+LOCAL 56.76 ±0.30 286.78 ±0.50 +3.05
CACHE+LOCAL 57.74 ±0.11 291.65 ±0.80 +4.83

The polarities of +LOCAL's relative BWs flipped showing gains over SYSTEM.
However, the fact that they're significantly worse than CACHE didn't change.

This is 4 in-flight 512 byte IOs, entirely plausible in any size systems.
Even at this low level of concurrency, the downside of using per-cpu
workqueue (CPU+STRICT) is clear.


5. SYNC

Let's push it further. It's now single threaded synchronous read/write(2)'s
w/ 512 byte blocks. If we're going to be able to extract meaningful gains
from localizing to the issuing CPU, this should show it.

taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
--ioengine=io_uring --iodepth=1 --runtime=60 --numjobs=1 \
--time_based --group_reporting --name=iops-test-job --verify=sha512

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 18.64 ±0.19 88.41 ±0.47 0.00
CACHE 21.46 ±0.11 91.39 ±0.26 +15.13
CACHE+STRICT 20.80 ±0.23 90.68 ±0.15 +11.59
CPU+STRICT 21.12 ±0.61 95.29 ±0.51 -1.58
SYSTEM+LOCAL 20.80 ±0.20 90.12 ±0.34 +11.59
CACHE+LOCAL 21.46 ±0.09 93.18 ±0.04 +15.13

Now +LOCAL's are doing quite a bit better than SYSTEM but still can't beat
CACHE. Interestingly, CPU+STRICT performs noticeably worse than others while
occupying CPU for longer. I haven't thought too much why this would be,
maybe because the benefits of using resources on other CPUs beat the
overhead of doing so?


Summary
=======

The overall conclusion doesn't change much. +LOCAL's fare better with lower
concurrency but still can't beat CACHE. I tried other combinations and
against SSD too. Nothing significantly deviates from the overall pattern.

I'm sure we can concoct a workload which uses significantly less than one
full CPU (so that utilizing other CPU's resources don't bring enough gain)
and can stay within L1/2 to maximize the benefit of not having to go through
L3, but it looks like that's going to take some stretching. At least on the
CPU I'm testing, it looks like letting loose in each L3 domain is the right
thing to do.

And, ARM folks, AFAICS, workqueue is unlikely to be the root cause of the
significant regressions observed after switching to recent kernel. However,
this patchset should hopefully alleviate the symptoms significantly, or,
failing that, at least make working around a lot easier. So, please test and
please don't switch CPU-heavy unbound workqueues to per-cpu. That's
guaranteed to hurt other heavier-weight usages.

Thanks.

--
tejun