Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

From: Huang, Ying
Date: Fri Feb 23 2024 - 04:13:28 EST


Gregory Price <gregory.price@xxxxxxxxxxxx> writes:

> On Thu, Feb 22, 2024 at 03:10:11PM +0800, Huang, Ying wrote:
>> Hi, Gregory,
>>
>> Thanks a lot for working on this!
>>
>
> It's worth doing :]
>
>> > + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
>> > + if (!new_bw)
>> > + return -ENOMEM;
>>
>> We only use "node_bw_table" in this function with "default_iwt_lock"
>> held. So, we don't need to copy-on-write? Just change in place?
>>
>
> I'd originally planned to add a sysfs entry for the data, which would
> have added RCU to this, but i realized it's just duplicating the
> node/accessX/initiator information, so i'll rip this out and just do in
> place changes.
>
>> > + new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
>> > + if (!new_iw) {
>> > + kfree(new_bw);
>> > + return -ENOMEM;
>> > + }
>> > +
>> > + mutex_lock(&default_iwt_lock);
>> > + old_bw = node_bw_table;
>> > + old_iw = rcu_dereference_protected(default_iw_table,
>> > + lockdep_is_held(&default_iwt_lock));
>> > +
>> > + if (old_bw)
>> > + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
>> > + new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);
>>
>> We need to compress two bandwidth data into one. The possible choice
>> could be,
>>
>> - min(coords->read_bandwidth, coords->write_bandwidth), that is, your code
>>
>> - coords->read_bandwidth + coords->write_bandwidth
>>
>> I don't know which one is better. Do you have some use cases to help to
>> determine which one is better?
>
> More generally: Are either read_bandwidth or write_bandwidth values
> even worth trusting as-is? Should they be combined? Averaged?
> Minimumed? Maximumed? Should they be floored to some reasonably round
> number? These are the comments i'm hoping to garner :].
>
> I've also considered maybe adding a weighted_interleave/read_write_ratio
> sysfs entry that informs the system on how to treat the incoming
> numbers. This would require us to cache more information, obviously.
>
> I have limited access to hardware, but here is one datum from an Intel
> platform w/ a sample CXL memory expander.
>
> # DRAM on node0
> cat /sys/bus/node/devices/node0/access0/initiators/*bandwidth
> 262100 < read
> 176100 < write
>
> Notice the 90GB/s difference between read and write, and the trailing
> 100! That doesn't look to be good for a clean reduction :[
>
> # CXL 1.1 device on node2
> cat /sys/bus/node/devices/node2/access0/initiators/*bandwidth
> 60000 < read
> 60000 < write
>
> These are pretty un-even distributions, and we may want to consider
> forcing numbers to be a little more round - otherwise we're doomed to
> just end up with whatever the ~/100 value is. Or we need to come up with
> some reduction that gets us down to reasonable small interleave values.
>
> In this scenario, we end up with:
>>>> 60000+176100
> 236100
>>>> 60000/236100
> 0.25412960609911056
>>>> 176100/236100
> 0.7458703939008895
>
> Which turns into 25:74 if you jsut round down, or 25:75 if you round up.
>
> The problem is that any heuristic you come up with for rounding out the
> bandwidth values is bound to have degenerate results. What happens if I
> add another CXL memory expander? What happens with 2DPC? etc.
>
> I wanted to collect some thoughts on this. I'm not sure what the best
> "General" approach would be, and we may need some more data from people
> with access to more varied hardware.
>
> Maybe worth submitting to LSF/MM for a quick discussion, but I think
> we'll need some help figuring this one out.
>
>> > +
>> > + /* New recalculate the bandwidth distribution given the new info */
>> > + for (i = 0; i < nr_node_ids; i++)
>> > + ttl_bw += new_bw[i];
>> > +
>> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
>> > + for (i = 0; i < nr_node_ids; i++) {
>> > + if (new_bw[i])
>> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);

IIUC, the sum of interleave weights of all nodes will be 100. If there
are more than 100 nodes in the system, this doesn't work properly. How
about use some fixed number like "16" for DRAM node?

>> > + else
>> > + new_iw[i] = 1;
>>
>> If we lacks performance data for some node, it will use "1" as default
>> weight. It doesn't look like the best solution for me. How about use
>> the average available bandwidth to calculate the default weight? Or use
>> memory bandwidth of node 0 if performance data is missing?
>>
>
> If we lack performance data for a node, it's one of 3 cases
>
> 1) The device did not provide HMAT information
> 2) We did not integrate that driver into the system yet.
> 3) The node is not online yet (therefore the data hasn't been reported)
>
> #2 and #3 are not issues, the only real issue is #1.
>
> In this scenario, I'm not sure what to do. We must have a non-0 value
> for that device (to avoid div-by-0), but setting an abitrarily large
> value also seems bad.

I think that it's kind of reasonable to use DRAM bandwidth for device
without data. If there are only DRAM nodes and nodes without data, this
will make interleave weight to "1".

> My thought was that if you are using weighted interleave, you're
> probably already pretty confident that your environment is reasonably
> sane - i.e. HMAT is providing the values.
>
>> > + }
>> > + /*
>> > + * Now attempt to aggressively reduce the interleave weights by GCD
>> > + * We want smaller interleave intervals to have a better distribution
>> > + * of memory, even on smaller memory regions. If weights are divisible
>> > + * by each other, we can do some quick math to aggresively squash them.
>> > + */
>> > +reduce:
>> > + gcd_val = new_iw[i];
>>
>> "i" will be "nr_node_ids" in the first loop. Right?
>>
>
> ah good catch, this should be new_iw[node_being_updated], will update
>
>> > - weight = table ? table[nid] : 1;
>> > - weight = weight ? weight : 1;
>> > + weight = table ? table[nid] : 0;
>> > + weight = weight ? weight :
>> > + (default_table ? default_table[nid] : 1);
>>
>> This becomes long. I think that we should define a help function for this.
>
> Maybe? I didn't bother since it's not replicated elsewhere. It does
> look similar to the help function which fetches the node weight, but
> that function itself calls rcu_read_lock() since it is called during the
> bulk allocator.
>
> I think probably some more thought could be put into this interaction,
> this was just a first pass. Certainly could be improved.
>
> Thanks for the feedback, will chew on it a bit. Let me know your
> thoughts on the bandwidth examples above if you have any.

--
Best Regards,
Huang, Ying