On Sun, 2 Jul 2017, Thomas Gleixner wrote:
Thinking a bit more about that limbo mechanics.
In case that a RMID was never used on a particular package, the state check
forces an IPI on all packages unconditionally. That's suboptimal at least.
We know on which package a given RMID was used, so we could restrict the
checks to exactly these packages, but I'm not sure it's worth the
trouble. We might at least document that and explain why this is
implemented in that way.
Second thoughts on that. The allocation logic is:
+ if (list_empty(&rmid_free_lru)) {
+ ret = try_freeing_limbo_rmid();
+ if (list_empty(&rmid_free_lru))
+ return ret ? -ENOSPC : -EBUSY;
+ }
+
+ entry = list_first_entry(&rmid_free_lru,
+ struct rmid_entry, list);
+ list_del(&entry->list);
+
+ return entry->rmid;
That means, the free list is used as the primary source. One of my boxes
has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
RMIDs to the limbo list. On the next mkdir invocation the allocation goes
into the limbo path and the SMP function call has to walk the list with 142
entries on ALL online domains whether they used the RMID or not!
That's bad enough already and the number of RMIDs will not become smaller;
it doubled from HSW to BDW ...
The HPC and RT folks will love you for that - NOT!
So this needs to be solved differently.
Let's have a look at the context switch path first. That's the most
sensitive part of it.
if (static_branch_likely(&rdt_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}
That's optimized for the !monitoring case. So we can really penalize the
per task monitoring case.
if (static_branch_likely(&rdt_mon_enable_key)) {
if (unlikely(current->rmid)) {
newstate.rmid = current->rmid;
__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
}
}
Now in rmid_free() we can collect that information:
cpumask_clear(&tmpmask);
cpumask_clear(rmid_entry->mask);
cpus_read_lock();
for_each_online_cpu(cpu) {
if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
cpumask_set(cpu, tmpmask);
}
for_each_domain(d, resource) {
cpu = cpumask_any_and(d->cpu_mask, tmpmask);
if (cpu < nr_cpu_ids)
cpumask_set(cpu, rmid_entry->mask);
}
list_add(&rmid_entry->list, &limbo_list);
for_each_cpu(cpu, rmid_entry->mask)
schedule_delayed_work_on(cpu, rmid_work);
cpus_read_unlock();
The work function:
boot resched = false;
list_for_each_entry(rme, limbo_list,...) {
if (!cpumask_test_cpu(cpu, rme->mask))
continue;
if (!rmid_is_reusable(rme)) {
resched = true;
continue;
}
cpumask_clear_cpu(cpu, rme->mask);
if (!cpumask_empty(rme->mask))
continue;
/* Ready for reuse */
list_del(rme->list);
list_add(&rme->list, &free_list);
}
The alloc function then becomes:
if (list_empty(&free_list))
return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;
The switch_to() covers the task rmids. The per cpu default rmids can be
marked at the point where they are installed on a CPU in the per cpu
rmid_bitmap. The free path is the same for per task and per cpu.
Another thing which needs some thought it the CPU hotplug code. We need to
make sure that pending work which is scheduled on an outgoing CPU is moved
in the offline callback to a still online CPU of the same domain and not
moved to some random CPU by the workqueue hotplug code.
There is another subtle issue. Assume a RMID is freed. The limbo stuff is
scheduled on all domains which have online CPUs.
Now the last CPU of a domain goes offline before the threshold for clearing
the domain CPU bit in the rme->mask is reached.
So we have two options here:
1) Clear the bit unconditionally when the last CPU of a domain goes
offline.
2) Arm a timer which clears the bit after a grace period
#1 The RMID might become available for reuse right away because all other
domains have not used it or have cleared their bits already.
If one of the CPUs of that domain comes online again and is associated
to that reused RMID again, then the counter content might still contain
leftovers from the previous usage.
#2 Prevents #1 but has it's own issues vs. serialization and coordination
with CPU hotplug.
I'd say we go for #1 as the simplest solution, document it and if really
the need arises revisit it later.
Thanks,
tglx