It's not just userspace. Some drivers call irq_set_affinity{_hint}}()
also, with a non-overlapping affinity mask.
We could just error these requests, but some drivers rely on this
behavior. Consider the uncore driver I mentioned above, which WARNs
when the affinity setting fails. So it tries to set the affinity with
the cpumask of the cluster associated with the device, but with D06's
ITS config, below, there may be no overlap.
Does this PMU use the ITS? That's a pretty odd setup.
So this is a case where the device has an implicit affinity that
isn't that of the ITS. Huhu...
What you are advocating for is a strict adherence to the provided
mask, and it doesn't seem to be what other architectures are providing.
I consider the userspace-provided affinity as a hint more that anything
else, as in this case the kernel does know better (routing the interrupt
to a foreign node might be costly, or even impossible, see the TX1
erratum).
Right
 From what I remember of the earlier discussion, you saw an issue on
a system with two sockets and a single ITS, with the node mask limited
to the first socket. Is that correct?
A bit more complicated: 2 sockets, 2 NUMA nodes per socket, and ITS
config as follows:
D06ESÂ 1x ITS with proximity node #0
root@(none)$ dmesg | grep ITS
[ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
D06CS
2x ITS with proximity node #0, #2
estuary:/$ dmesg | grep ITS
[ÂÂÂ 0.000000] SRAT: PXM 0 -> ITS 0 -> Node 0
[ÂÂÂ 0.000000] SRAT: PXM 2 -> ITS 1 -> Node 2
It complicates things.
We could add extra intelligence to record if an node has an ITS
associated. In the case of that not being true, we would fallback on
the requested affin only (for case of no overlap). It gets a bit more
messy then.
It looks like part of the problem is that we can't reliably describe
an ITS affine to multiple NUMA nodes... If we could describe that, then
the above situation wouldn't occur (we'd say that ITS-0 covers both
nodes 0 and 1). But I can't find a way to express this with SRAT and
_PXM. Also, SRAT describes the affinity of the ITS with memory, and not
with the CPUs... It is all a bit fsck'd. :-(
I guess I'll apply your change for now with a comment explaining the
situation.