Hi Laurent,
Bear with me while I work through the commit message:
Laurent Dufour <ldufour@xxxxxxxxxxxxx> writes:
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.
Yes, the RTAS functions ibm,update-nodes and ibm,update-properties,
which the OS invokes after resuming, may bring in updated properties
under the ibm,dynamic-reconfiguration-memory node, including the
ibm,associativity-lookup-arrays property.
This is caught by the kernel,
"Caught" makes me think this is an error condition, as in catching an
exception. I guess "handled" better conveys your meaning?
but the memory's node is updated because
there is no way to move a memory block between nodes.
"The memory's node" refers the ibm,dynamic-reconfiguration-memory DT
node, yes? Or is it referring to Linux's NUMA nodes? ("move a memory
block between nodes" in your statement here refers to Linux's NUMA
nodes, that much is clear to me.)
I am failing to follow the cause->effect relationship stated. True,
changing a block's node assignment while it's in use isn't safe. I don't
see why that implies that "the memory's node is updated"? In fact this
seems contradictory.
This statement makes more sense to me if I change it to "the memory's
node is _not_ updated" -- is this what you intended?
If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB.
I understand this, but I will expand on it.
dlpar_memory()
-> dlpar_memory_add_by_count()
-> dlpar_add_lmb()
-> update_lmb_associativity_index()
... lmb->aa_index = <value>
-> drmem_update_dt()
update_lmb_associativity_index() retrieves the firmware description of
the new block, and sets the aa_index of the matching entry in the
drmem_info array to the value matching the firmware description.
Then, drmem_update_dt() walks the drmem_info array and synthesizes a new
/ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory-v2 property based
on the recently updated information in that array.
But the LMB's associativity node has not been updated after the DT
node update and thus the node is overwritten by the Linux's topology
instead of the hypervisor one.
So, an example of the problem is:
1. VM migrates. On resume, ibm,associativity-lookup-arrays is changed
via ibm,update-properties. Entries in the drmem_info array remain
unchanged, with aa_index values that correspond to the source
system's ibm,associativity-lookup-arrays property, now inaccessible.
2. A memory block is added. We look up the new block's entry in the
drmem_info array, and set the aa_index to the value matching the
current ibm,associativity-lookup-arrays.
3. Then, the ibm,associativity-lookup-arrays property is completely
regenerated from the drmem_info array, which reflects a mixture of
information from the source and destination systems.
Do I understand correctly?
Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.
This strikes me as almost a revert of e978a3ccaa71 ("powerpc/pseries:
remove obsolete memory hotplug DT notifier code").
I'd rather avoid smuggling through global state information that ought
to be passed in function parameters, if it should be passed around at
all. Despite having (IMO) relatively simple responsibilities, this code
is difficult to change and review; adding this property makes it
worse. If the structure of the code is pushing us toward this kind of
compromise, then the code probably needs more fundamental changes.
I'm probably forgetting something -- can anyone remind me why we need an
array of these:
struct drmem_lmb {
u64 base_addr;
u32 drc_index;
u32 aa_index;
u32 flags;
};
which is just a less efficient representation of what's already in the
device tree? If we got rid of it, would this problem disappear?