Re: [PATCH V5 1/4] ACPI: Support Generic Initiator only domains
From: Dan Williams
Date: Wed Nov 13 2019 - 18:20:16 EST
On Wed, Nov 13, 2019 at 9:49 AM Jonathan Cameron
<jonathan.cameron@xxxxxxxxxx> wrote:
>
> On Wed, 13 Nov 2019 21:57:24 +0800
> Tao Xu <tao3.xu@xxxxxxxxx> wrote:
>
> > On 11/13/2019 5:47 PM, Jonathan Cameron wrote:
> > > On Tue, 12 Nov 2019 09:55:17 -0800
> > > Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > >
> > >> [ add Tao Xu ]
> > >>
> > >> On Fri, Oct 4, 2019 at 4:45 AM Jonathan Cameron
> > >> <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > >>>
> > >>> Generic Initiators are a new ACPI concept that allows for the
> > >>> description of proximity domains that contain a device which
> > >>> performs memory access (such as a network card) but neither
> > >>> host CPU nor Memory.
> > >>>
> > >>> This patch has the parsing code and provides the infrastructure
> > >>> for an architecture to associate these new domains with their
> > >>> nearest memory processing node.
> > >>
> > >> Thanks for this Jonathan. May I ask how this was tested? Tao has been
> > >> working on qemu support for HMAT [1]. I have not checked if it already
> > >> supports generic initiator entries, but it would be helpful to include
> > >> an example of how the kernel sees these configurations in practice.
> > >>
> > >> [1]: http://patchwork.ozlabs.org/cover/1096737/
> > >
> > > Tested against qemu with SRAT and SLIT table overrides from an
> > > initrd to actually create the node and give it distances
> > > (those all turn up correctly in the normal places). DSDT override
> > > used to move an emulated network card into the GI numa node. That
> > > currently requires the PCI patch referred to in the cover letter.
> > > On arm64 tested both on qemu and real hardware (overrides on tables
> > > even for real hardware as I can't persuade our BIOS team to implement
> > > Generic Initiators until an OS is actually using them.)
> > >
> > > Main real requirement is memory allocations then occur from one of
> > > the nodes at the minimal distance when you are do a devm_ allocation
> > > from a device assigned. Also need to be able to query the distances
> > > to allow load balancing etc. All that works as expected.
> > >
> > > It only has a fairly tangential connection to HMAT in that HMAT
> > > can provide information on GI nodes. Given HMAT code is quite happy
> > > with memoryless nodes anyway it should work. QEMU doesn't currently
> > > have support to create GI SRAT entries let alone HMAT using them.
> > >
> > > Whilst I could look at adding such support to QEMU, it's not
> > > exactly high priority to emulate something we can test easily
> > > by overriding the tables before the kernel reads them.
> > >
> > > I'll look at how hard it is to build an HMAT tables for my test
> > > configs based on the ones I used to test your HMAT patches a while
> > > back. Should be easy if tedious.
> > >
> > > Jonathan
> > >
> > Indeed, HMAT can support Generic Initiator, but as far as I know, QEMU
> > only can emulate a node with cpu and memory, or memory-only. Even if we
> > assign a node with cpu only, qemu will raise error. Considering
> > compatibility, there are lots of work to do for QEMU if we change NUMA
> > or SRAT table.
> >
>
> I faked up a quick HMAT table.
>
> Used a configuration with 3x CPU and memory nodes, 1x memory only node
> and 1x GI node. Two test cases, one where the GI initiator is further than
> the CPU containing nodes from the memory only node (realistic case for
> existing hardware). That behaves as expected and there are no
> /sys/node/bus/nodeX/access0 entries for the GI node
> + appropriate ones for the memory only node as normal.
>
> The other case is more interesting we have the memory only node nearer
> to the GI node than to any of the CPUs. In that case for x86 at least
> the HMAT code is happy to put an access0 directory GI in the GI node
> with empty access0/initiators and the memory node under access0/targets
>
> The memory only node is node4 and the GI node node3.
>
> So relevant dirs under /sys/bus/nodes/devices
>
> node3/access0/initators/ Empty
> node3/access0/targets/node4
This makes sense node3 is an initiator, no other nodes can initiate to it.
> node4/access0/initators/[node3 read_bandwidth write_bandwith etc]
> node4/access0/targets/ Empty
>
> So the result current (I think - the HMAT interface still confuses
> me :) is that a GI node is treated like a CPU node. This might mean
> there is no useful information available if you want to figure out
> which CPU containing node is nearest to Memory when the GI node is
> nearer still.
>
> Is this a problem? I'm not sure...
>
> If we don't want to include GI nodes then we can possibly
> use the node_state(N_CPU, x) method to check before considering
> them, or I guess parse SRAT to extract that info directly.
>
> I tried this and it seems to work so can add patch doing this
> next version if we think this is the 'right' thing to do.
>
> So what do you think 'should' happen?
I think this might be our first case for adding an "access1" instance
by default. I.e. in the case when access0 is not a cpu, then access1
is there to at least show the "local" cpu and let userspace see the
performance difference of cpu vs a specific-initiator access.