Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group

From: Megha Dey
Date: Mon Aug 12 2019 - 13:26:00 EST


On Wed, 2019-08-07 at 15:56 +0200, Thomas Gleixner wrote:
> Megha,
>
> On Tue, 6 Aug 2019, Megha Dey wrote:
> >
> > On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> > >
> > > On Fri, 21 Jun 2019, Megha Dey wrote:
> > Totally agreed. The request to add a dynamic MSI-X infrastructure
> > came
> > from some driver teams internally and currently they do not have
> > bandwidth to come up with relevant test cases. <sigh>
> Hahahaha.
>
> >
> > But we hope that this patch set could serve as a precursor to the
> > interrupt message store (IMS) patch set, and we can use this patch
> > set
> > as the baseline for the IMS patches.
> If IMS needs the same functionality, then we need to think about it
> slightly differently because IMS is not necessarily tied to PCI.
> Â
> IMS has some similarity to the ARM GIC ITS stuff IIRC, which already
> provides these things outside of PCI. Marc?
>
> We probably need some generic infrastructure for this so PCI and
> everything
> else can use it.

Yeah agreed, I will look at the ARM GIC ITS to see how we can
consolidate this code.

>
> >
> > >
> > > >
> > > > + /*
> > > > + Â* Save the pointer to the first msi_desc
> > > > entry of
> > > > every
> > > > + Â* MSI-X group. This pointer is used by other
> > > > functions
> > > > + Â* as the starting point to iterate through
> > > > each
> > > > of the
> > > > + Â* entries in that particular group.
> > > > + Â*/
> > > > + if (!i)
> > > > + dev->dev.grp_first_desc =
> > > > list_last_entry
> > > > + (dev_to_msi_list(&dev->dev), struct
> > > > msi_desc, list);
> > > How is that supposed to work? The pointer gets overwritten on
> > > every
> > > invocation of that interface. I assume this is merily an
> > > intermediate
> > > storage for setup. Shudder.
> > >
> > Yes, you are right.
> >
> > The grp_first_desc is simply a temporary storage to store the
> > first msi_desc entry of every group, which can be used by other
> > functions to iterate through the entries belonging to that group
> > only,
> > using the for_each_pci_msi_entry/ for_each_msi_entry_from macro. It
> > is
> > not the cleanest of solutions, I agree.
> Yeah, it's too ugly to exist.
>

will get rid of it.

> >
> > With your proposal of supporting a separate group list, I don't
> > think
> > there will be a need to use this kind of temporary storage
> > variable.
> Exactly.
>
> >
> > >
> > > >
> > > > - for_each_pci_msi_entry(entry, dev) {
> > > > + for_each_pci_msi_entry_from(entry, dev) {
> > > Â > +/* Iterate through MSI entries of device dev starting from a
> > > given desc */
> > > Â > +#define for_each_msi_entry_from(desc,
> > > dev)ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ\
> > > Â > +ÂÂÂÂÂÂÂdesc =
> > > (*dev).grp_first_desc;ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ\
> > > Â > +ÂÂÂÂÂÂÂlist_for_each_entry_from((desc),
> > > dev_to_msi_list((dev)),
> > > list)ÂÂ\
> > >
> > > So this hides the whole group stuff behind a hideous iterator.
> > >
> > > for_each_pci_msi_entry_from() ? from what? from the device? Sane
> > > iterators
> > > which have a _from naming, have also a from argument.
> > >
> > This was meant to be "iterate over all the entries belonging to a
> > group", sorry if that was not clear.Â
> >
> > The current 'for_each_pci_msi_entry' macro iterates through all the
> > msi_desc entries belonging to a particular device. Since we have a
> > piecewise allocation of the MSI-X vectors with this change, we
> > would
> > want to iterate only through the newly added entries, i.e the
> > entries
> > allocated to the current group.
> I understand that, but please make macros and function names so they
> are
> halfways self explaining and intuitive.
>

yeah, point taken.

> Â> In V2, I will introduce a new macro,
> 'for_each_pci_msi_entry_group',
> >
> > which will only iterate through the msi_desc entries belonging to a
> > particular group.
> for_each_pci_msi_entry_group()
>
> is ambiguous. It could mean to iterate over the groups.Â
>
> for_each_pci_msi_entry_in_group()
>
> avoids that.

Yes, agreed.
> Â
> >
> > >
> > > >
> > > > - ret = msix_setup_entries(dev, base, entries, nvec,
> > > > affd);
> > > > + ret = msix_setup_entries(dev, dev->base, entries,
> > > > nvec,
> > > > affd, group);
> > > > Â if (ret)
> > > > Â return ret;
> > > Any error exit in this function will leave MSIx disabled. That
> > > means
> > > if
> > > this is a subsequent group allocation which fails for whatever
> > > reason, this
> > > will render all existing and possibly already in use interrupts
> > > unusable.
> > >
> > Hmmm yeah, I hadn't thought about this!
> >
> > So according to the code, we must 'Ensure MSI-X is disabled while
> > it is
> > set up'. MSI-X would be disabled until the setup of the new vectors
> > is
> > complete, even if we do not take the error exit right?
> >
> > Earlier this was not a problem since we disable the MSI-X, setup
> > all
> > the vectors at once, and then enable the MSI-X once and for all.Â
> >
> > I am not sure how to avoid disabling of MSI-X here.
> The problem with your code is that is keeps it disabled in case of an
> error, which makes all existing users (groups) starve.
>
> But, yes there is also the question what happens during the time when
> interrupts are raised on already configured devices exactly during
> the time
> where MSI-X is disabled temporarily to setup a new group. I fear that
> will
> end up with lost interrupts and/or spurious interrupts via the legacy
> INT[ABCD]. That really needs to be investigated _before_ we go there.
>

Hmm, yeah that is my concern, I do not know what the behavior would be.
Any experiments that I can run to know more?

> >
> > >
> > > >
> > > > Âstatic int __pci_enable_msix_range(struct pci_dev *dev,
> > > > Â ÂÂÂstruct msix_entry *entries,
> > > > int
> > > > minvec,
> > > > - ÂÂÂint maxvec, struct
> > > > irq_affinity
> > > > *affd)
> > > > + ÂÂÂint maxvec, struct
> > > > irq_affinity
> > > > *affd,
> > > > + ÂÂÂbool one_shot, int group)
> > > > Â{
> > > > Â int rc, nvec = maxvec;
> > > > Â
> > > > Â if (maxvec < minvec)
> > > > Â return -ERANGE;
> > > > Â
> > > > - if (WARN_ON_ONCE(dev->msix_enabled))
> > > > - return -EINVAL;
> > > So any misbehaving PCI driver can now call into this without
> > > being
> > > caught.
> > >
> > I do not understand what misbehaving PCI driver means :(
> The one which calls into that interface _AFTER_ msix is enabled. We
> catch
> that right now and reject it.
>

Right.
> >
> > Basically this statement is what denies multiple MSI-X vector
> > allocations, and I wanted to remove it so that we could do just
> > that.
> >
> > Please let me know how I could change this.
> There are several ways to do that, but it needs to be made
> conditionally on
> things like 'device has group mode support' ...
> Â

Ok.
> >
> > >
> > > If you want to support group based allocations, then the PCI/MSI
> > > facility
> > > has to be refactored from ground up.
> > >
> > > Â 1) Introduce the concept of groups by adding a group list head
> > > to
> > > struct
> > > ÂÂÂÂÂpci_dev. Ideally you create a new struct pci_dev_msi or
> > > whatever
> > > where
> > > ÂÂÂÂÂall this muck goes into.
> > >
> > I think we can use the existing list_head 'msi_list' in the struct
> > device for this, instead of having a new list_head for the group.
> > So
> > now instead of msi_list being a list of all the msi_desc entries,
> > it
> > will have a list of the different groups associated with the
> > device.
> >
> > IMHO, since IMS is non PCI compliant, having this group_list_head
> > would
> > be better off in struct device than struct pci_dev, which would
> > enable
> > code reuse.
> Sure, but then we really need to look at the IMS requirements in
> order not
> to rewrite this whole thing over and over.
>

Yeah, since IMS is non PCI compliant, if we make all these changes at
the struct device level unlike the struct pci_dev level, I think we can
reuse the same code for all MSI/ MSI like interrupts.

Having said that I will send out an RFC of the initial IMS code, to
avoid any sort of code duplication.
> >
> > >
> > > Â 2) Change the existing code to treat the current allocation
> > > mode as
> > > a
> > > ÂÂÂÂÂgroup allocation. Keep the entries in a new struct
> > > msi_entry_group and
> > > ÂÂÂÂÂhave a group id, list head and the entries in there.
> > >
> > I am thinking of something like this, please let me know if this is
> > what you are suggesting:
> >
> > 1. Introduce a new msi_entry_group struct:
> > struct msi_entry_grp {
> > Â int group_id; // monotonically increasing group_id
> > Â int num_vecs; // number of msi_desc entries per group
> > Â struct list_head group_list; // Added to msi_list in struct
> > device
> > Â struct list_head entry_list; // list of msi_desc entries for this
> > grp
> > }
> Looks about right.

Ok!Â
> >
> > 2. Add a new 'for_each_pci_msi_entry_group' macro. This macro
> > should
> > only iterate through the msi_desc entries belonging to a group.
> See above.
> Â

will update this is V2.

> >
> > 3. The existing for_each_pci_msi_entry, needs to be modified so
> > that it
> > is backward compatible. This macro should still be able to iterate
> > through all the entries in all the groups.Â
> I'm not sure. It might be just the thing which iterates over group 0,
> which
> is the default for all devices which do not use/support group mode,
> but
> let's see.
> Â

Yeah, I am currently trying to get this approach to work i.e. reworking
the for_each_pci_msi_entry macro to be group aware (not gotten it to
work thus far, though).

> Thanks,
>
> tglx