Re: [PATCH v5] PCI: vmd: Add the module param to adjust MSI mode
From: Xinghui Li
Date: Mon May 08 2023 - 09:00:26 EST
On Sat, May 6, 2023 at 12:08 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> > I am fine with these two ways naming of the param. Adjusting from
> > enable_msi_remaping to disable_msi_bypass was aimed to trying address
> > your comment about dealing with the device not supporting bypass.
> > And in vmd drivers, the vmd bypass feature is enabled by adding the flag
> > "VMD_FEAT_CAN_BYPASS_MSI_REMAP". Therefore, I think disabling
> > bypass seems more appropriate. This patch aims to provide a convenient
> > way to remove that flag in a specific case.
>
> Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP. I
> don't think that's a very good name either (in my opinion
> "VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and
> VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI
> remapping), but in any case these are internal to the driver.
>
> > On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > The "disable_msi_bypass" parameter name also leads to some complicated
> > > logic. IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
> > > bypassing" or, in other words, "do not remap MSIs." This is only
> > > supported by some VMDs. Using "disable_msi_bypass=0" to *enable* the
> > > bypass feature is confusing.
> >
> > However, as you said, it does lead to some complicated logic. So, I
> > also believe that these two approaches have their own pros and cons.
> >
> > > I still don't understand what causes the performance problem here. I
> > > guess you see higher performance when the VMD remaps child MSIs? So
> > > adding the VMD MSI-X domain interrupt handler and squashing all the
> > > child MSI vectors into the VMD MSI vector space makes things better?
> > > That seems backwards. Is this because of cache effects or something?
> >
> > > What does "excessive pressure on the PCIe node" mean? I assume the
> > > PCIe node means the VMD? It receives the same number of child
> > > interrupts in either case.
> >
> > What I mean is that there will be performance issues when a PCIe domain
> > is fully loaded with 4 Gen4 NVMe devices. like this:
> > +-[10002:00]-+-01.0-[01]----00.0 device0
> > | +-03.0-[02]----00.0 device1
> > | +-05.0-[03]----00.0 device2
> > | \-07.0-[04]----00.0 device3
> >
> > According to the perf/irqtop tool, we found the os does get the same
> > counts of interrupts in different modes. However, when the above
> > situation occurs, we observed a significant increase in CPU idle
> > time. Besides, the data and performance when using the bypass VMD
> > feature were identical to when VMD was disabled. And after the
> > devices mounted on a domain are reduced, the IOPS of individual
> > devices will rebound somewhat. Therefore, we speculate that VMD can
> > play a role in balancing and buffering interrupt loads. Therefore,
> > in this situation, we believe that VMD ought to not be bypassed to
> > handle MSI.
>
> The proposed text was:
>
> Use this when multiple NVMe devices are mounted on the same PCIe
> node with a high volume of 4K random I/O. It mitigates excessive
> pressure on the PCIe node caused by numerous interrupts from NVMe
> drives, resulting in improved I/O performance. Such as:
>
> The NVMe configuration and workload you mentioned works better with
> MSI-X remapping. But I don't know *why*, and I don't know that NVMe
> is the only device affected. So it's hard to write useful guidance
> for users, other than "sometimes it helps."
>
> Straw man proposal:
>
> msi_remap=0
>
> Disable VMD MSI-X remapping, which improves interrupt performance
> because child device interrupts avoid the VMD MSI-X domain
> interrupt handler. Not all VMD devices support this, but for
> those that do, this is the default.
>
> msi_remap=1
>
> Remap child MSI-X interrupts into VMD MSI-X interrupts. This
> limits the number of MSI-X vectors available to the whole child
> device domain to the number of VMD MSI-X interrupts.
>
> This may be required in some virtualization scenarios.
>
> This may improve performance in some I/O topologies, e.g., several
> NVMe devices doing lots of random I/O, but we don't know why.
>
> I hate the idea of "we don't know why." If you *do* know why, it
> would be much better to outline the mechanism because that would help
> users know when to use this. But if we don't know why, admitting that
> straight out is better than hand-waving about excessive pressure, etc.
>
I completely agree with you. I discovered this issue using the bisect method.
Based on the observed data and experiments, I drew the above conclusions.
We deduced the cause from the observed results. However, I have not yet
been able to determine the precise cause of this issue.
> The same applies to the virtualization caveat. The text above is not
> actionable -- how do users know whether their particular
> virtualization scenario requires this option?
>
I will add a note to make it clear that bypass mode will not work in guest
passthrought mode, only remapping mode can be used.
Thanks