RE: [PATCH v9 0/4] shut down devices asynchronously
From: Michael Kelley
Date: Thu Oct 17 2024 - 23:26:18 EST
From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Thursday, October 10, 2024 9:22 PM
>
> From: Stuart Hayes <stuart.w.hayes@xxxxxxxxx> Sent: Wednesday, October 9, 2024 10:58 AM
> >
> > This adds the ability for the kernel to shutdown devices asynchronously.
> >
> > Only devices with drivers that enable it are shut down asynchronously.
> >
> > This can dramatically reduce system shutdown/reboot time on systems that
> > have multiple devices that take many seconds to shut down (like certain
> > NVMe drives). On one system tested, the shutdown time went from 11 minutes
> > without this patch to 55 seconds with the patch.
>
> I've been testing this series against a 6.11.0 kernel in an Azure VM, which
> is running as a guest on the Microsoft Hyper-V hypervisor. The VM has 16 vCPUs,
> 128 Gbytes of memory, and two physical NVMe controllers that are mapped
> into the VM so that it can access them directly.
>
[snip]
>
> But here's the kicker: The overall process of shutting down the devices
> took *longer* with the patch set than without. Here's the same output
> from a 6.11.0 kernel without the patch set:
>
[snip]
>
> Any thoughts on what might be causing this? I haven't gone into the
> details of your algorithms for parallelizing, but is there any extra
> overhead that could be adding to the time? Or maybe this is
> something unique to Hyper-V guests. The overall difference is only
> a few 10's of milliseconds, so not that big of a deal. But maybe it's
> an indicator that something unexpected is happening that we should
> understand.
>
> I'll keep thinking about the issue and see if I can get any more insight.
I've debugged this enough to now know what is happening. I see the
same behavior in two different environments:
1) A Hyper-V VM with 8 vCPUs running on my local laptop. It has
no NVMe devices, so there's no opportunity for parallelism with this
patch set.
2) A Raspberry Pi 5 with 4 CPUs. Linux is running on the bare metal.
It has one NVMe device via the Pi 5 M.2 HAT.
In both cases, the loop in device_shutdown() goes through a lot of
devices: 492 in my Hyper-V VM, and 577 in the Pi 5. With the code
in this patch set, all the devices get added to the async_wq in
kernel/async.c. So async_wq quickly gets heavily populated.
In the process, the workqueue code spins up additional worker threads
to handle the load. On the Hyper-V VM, 210 to 230 new kernel
threads are created during device_shutdown(), depending on the
timing. On the Pi 5, 253 are created. The max for this workqueue is
WQ_DFL_ACTIVE (256).
Creating all these new worker threads, and scheduling and
synchronizing them takes 30 to 40 milliseconds of additional time
compared to the original code. On the Hyper-V VM, device_shutdown()
completes in about 5 millisecond with the original code, and +/- 40
milliseconds with the new async code. The Pi 5 results are more
variable, but also have roughly 30 to 40 milliseconds additional.
Is all this extra work a problem? I don't know. Clearly, there's big
benefit in parallelizing the NVMe shutdown, and in those situations
the extra 30 to 40 milliseconds can be ignored. But I wonder if there
are other situations whether the extra elapsed time and CPU
consumption might be a problem.
I also wonder about scalability. Does everything still work if
device_shutdown is processing 5000 devices? Is there a possibility
of deadlock if async_wq can only have 256 active items out of
5000 queued ones?
Michael Kelley
>
> >
> > Changes from V8:
> >
> > Deal with shutdown hangs resulting when a parent/supplier device is
> > later in the devices_kset list than its children/consumers:
> > * Ignore sync_state_only devlinks for shutdown dependencies
> > * Ignore shutdown_after for devices that don't want async shutdown
> > * Add a sanity check to revert to sync shutdown for any device that
> > would otherwise wait for a child/consumer shutdown that hasn't
> > already been scheduled
> >
> > Changes from V7:
> >
> > Do not expose driver async_shutdown_enable in sysfs.
> > Wrapped a long line.
> >
> > Changes from V6:
> >
> > Removed a sysfs attribute that allowed the async device shutdown to be
> > "on" (with driver opt-out), "safe" (driver opt-in), or "off"... what was
> > previously "safe" is now the only behavior, so drivers now only need to
> > have the option to enable or disable async shutdown.
> >
> > Changes from V5:
> >
> > Separated into multiple patches to make review easier.
> > Reworked some code to make it more readable
> > Made devices wait for consumers to shut down, not just children
> > (suggested by David Jeffery)
> >
> > Changes from V4:
> >
> > Change code to use cookies for synchronization rather than async domains
> > Allow async shutdown to be disabled via sysfs, and allow driver opt-in or
> > opt-out of async shutdown (when not disabled), with ability to control
> > driver opt-in/opt-out via sysfs
> >
> > Changes from V3:
> >
> > Bug fix (used "parent" not "dev->parent" in device_shutdown)
> >
> > Changes from V2:
> >
> > Removed recursive functions to schedule children to be shutdown before
> > parents, since existing device_shutdown loop will already do this
> >
> > Changes from V1:
> >
> > Rewritten using kernel async code (suggested by Lukas Wunner)
> >
> >
> > Stuart Hayes (4):
> > driver core: don't always lock parent in shutdown
> > driver core: separate function to shutdown one device
> > driver core: shut down devices asynchronously
> > nvme-pci: Make driver prefer asynchronous shutdown
> >
> > drivers/base/base.h | 4 +
> > drivers/base/core.c | 137 +++++++++++++++++++++++++++-------
> > drivers/nvme/host/pci.c | 1 +
> > include/linux/device/driver.h | 2 +
> > 4 files changed, 118 insertions(+), 26 deletions(-)
> >
> > --
> > 2.39.3
> >
>