On Fri, Feb 16, 2018 at 10:06:29AM -0800, Raghavendra Rao Ananta wrote:I've gone through your diff, and it gave me a hint of similar texture what we are trying to do (except for maintaining an offline event list). Nevertheless, I tried to test your patch. I created an hw event, and tried to offline the CPU in parallel, and I immediately hit a watchdog soft lockup bug! Tried the same this by first switching off the CPU (without any event created), and I hit into similar issue. I am sure we can fix it, but apart from the "why we are doing hotplug?" question, was was there specifically any issue with our patch?
No this is absolutely disguisting. You can simply keep the events in theKeeping the events in the dead CPU's context was also an idea that we had.
dead CPU's context. It's really not that hard.
However, detaching that event from the PMU when the CPU is offline would be
a pain. Consider the scenario in which an event is about to be destroyed
when the CPU is offline (yet still attached to the CPU). During it's
destruction, a cross-cpu call is made (from perf_remove_from_context()) to
the offlined CPU to detach the event from the CPU's PMU. As the CPU is
offline, that would not be possible, and again a separate logic has to be
written for cleaning up the events whose CPUs are offlined.
That is actually really simple to deal with. The real problems are with
semantics, is an event enabled when the CPU is dead? Can you
disable/enable an event on a dead CPU.
The below patch (_completely_ untested) should do most of it, but needs
help with the details. I suspect we want to allow enable/disable on
events that are on a dead CPU, and equally I think we want to account
the time an enabled event spends on a dead CPU to go towards the
'enabled' bucket.
I wanted to understand, if we no longer care about hotplugging of CPUs, then why do we still have exported symbols such as cpu_up() and cpu_down()? Moreover, we also have the hotplug interface exposed to users-space as well (through sysfs). As long as these interfaces exist, there's always a potential chance of bringing the CPU up/down. Can you please clear this thing up for me?
Also, you _still_ don't explain why you care about dead CPUs.