Re: [PATCH 1/2] acpi : cpu hot-remove returns error number whencpu_down() fails
From: Toshi Kani
Date: Tue Jul 10 2012 - 12:36:31 EST
On Tue, 2012-07-10 at 13:27 +0530, Srivatsa S. Bhat wrote:
> On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
> > Hi Toshi,
> >
> > 2012/07/10 6:15, Toshi Kani wrote:
> >> On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
> >>> On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
> >>>> Hi Srivatsa,
> >>>>
> >>>> Thank you for your reviewing.
> >>>>
> >>>> 2012/07/06 18:51, Srivatsa S. Bhat wrote:
> >>>>> On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
> >>>>>> Even if cpu_down() fails, acpi_processor_remove() continues to
> >>>>>> remove the cpu.
> >>>>>
> >>>>> Ouch!
> >>>>>
> >>>>>> But in this case, it should return error number since some process
> >>>>>> may run on
> >>>>>> the cpu. If the cpu has a running process and the cpu is turned
> >>>>>> the power off,
> >>>>>> the system cannot work well.
> >>>>>>
> >>>>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx>
> >>>>>>
> >>>>>> ---
> >>>>>> drivers/acpi/processor_driver.c | 18 ++++++++++++------
> >>>>>> 1 file changed, 12 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>>>> ===================================================================
> >>>>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c
> >>>>>> 2012-06-25 04:53:04.000000000 +0900
> >>>>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05
> >>>>>> 21:02:58.711285382 +0900
> >>>>>> @@ -610,7 +610,7 @@ err_free_pr:
> >>>>>> static int acpi_processor_remove(struct acpi_device *device,
> >>>>>> int type)
> >>>>>> {
> >>>>>> struct acpi_processor *pr = NULL;
> >>>>>> -
> >>>>>> + int ret;
> >>>>>>
> >>>>>> if (!device || !acpi_driver_data(device))
> >>>>>> return -EINVAL;
> >>>>>> @@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
> >>>>>> goto free;
> >>>>>>
> >>>>>> if (type == ACPI_BUS_REMOVAL_EJECT) {
> >>>>>> - if (acpi_processor_handle_eject(pr))
> >>>>>> - return -EINVAL;
> >>>>>> + ret = acpi_processor_handle_eject(pr);
> >>>>>> + if (ret)
> >>>>>> + return ret;
> >>>>>> }
> >>>>>>
> >>>>>> acpi_processor_power_exit(pr, device);
> >>>>>> @@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd
> >>>>>>
> >>>>>> static int acpi_processor_handle_eject(struct acpi_processor *pr)
> >>>>>> {
> >>>>>> - if (cpu_online(pr->id))
> >>>>>> - cpu_down(pr->id);
> >>>>>> + int ret;
> >>>>>> +
> >>>>>> + if (cpu_online(pr->id)) {
> >>>>>> + ret = cpu_down(pr->id);
> >>>>>> + if (ret)
> >>>>>> + return ret;
> >>>>>> + }
> >>>>>>
> >>>>>
> >>>>> Strictly speaking, this is not thorough enough. What prevents someone
> >>>>> from onlining that same cpu again, at this point?
> >>>>> So, IMHO, you need to wrap the contents of this function inside a
> >>>>> get_online_cpus()/put_online_cpus() block, to prevent anyone else
> >>>>> from messing with CPU hotplug at the same time.
> >>>>
> >>>> If I understand your comment by mistake, please let me know.
> >>>> If the contents is wrapped a inside
> >>>> get_online_cpus()/put_online_cpus() block
> >>>> as below, cpu_down() will stop since cpu_down() calls
> >>>> cpu_hotplug_begin() and
> >>>> cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.
> >>>>
> >>>
> >>> You are right. Sorry, I overlooked that.
> >>>
> >>>> + get_online_cpus()
> >>>> + if (cpu_online(pr->id)) {
> >>>> + ret = cpu_down(pr->id);
> >>>> + if (ret)
> >>>> + return ret;
> >>>> + }
> >>>> + put_online_cpus()
> >>>>
> >>>> I think following patch can prevent it correctly. How about the patch?
> >>>>
> >>>> ---
> >>>> drivers/acpi/processor_driver.c | 12 ++++++++++++
> >>>> kernel/cpu.c | 8 +++++---
> >>>> 2 files changed, 17 insertions(+), 3 deletions(-)
> >>>>
> >>>> Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
> >>>> ===================================================================
> >>>> --- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09
> >>>> 09:59:01.280211202 +0900
> >>>> +++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09
> >>>> 11:05:34.559859236 +0900
> >>>> @@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
> >>>> {
> >>>> int ret;
> >>>>
> >>>> +retry:
> >>>> if (cpu_online(pr->id)) {
> >>>> ret = cpu_down(pr->id);
> >>>> if (ret)
> >>>> return ret;
> >>>> }
> >>>>
> >>>> + get_online_cpus();
> >>>> + /*
> >>>> + * Someone might online the cpu again at this point. So we
> >>>> check that
> >>>> + * cpu has been onlined or not. If cpu is online, we try to
> >>>> offline
> >>>> + * the cpu again.
> >>>> + */
> >>>> + if (cpu_online(pr->id)) {
> >>>
> >>> How about this:
> >>> if (unlikely(cpu_online(pr->id)) {
> >>> since the probability of this happening is quite small...
> >>>
> >>>> + put_online_cpus();
> >>>> + goto retry;
> >>>> + }
> >>>> arch_unregister_cpu(pr->id);
> >>>> acpi_unmap_lsapic(pr->id);
> >>>> + put_online_cpus();
> >>>> return ret;
> >>>> }
> >>>
> >>> This retry logic doesn't look elegant, but I don't see any better
> >>> method :-(
> >>
> >> Another possible option is to fail the request instead of retrying it.
> >
> > Good idea!! I'll update it.
Sounds good. Thanks Yasuaki for updating it.
> >> It would be quite challenging to allow on-lining and off-lining
> >> operations to run concurrently. In fact, even if we close this window,
> >> there is yet another window right after the new put_online_cpus() call.
> >
> > I think if we close the window, another window does not open.
> > acpi_unmap_lsapic() sets cpu_present mask to false before new
> > put_online_cpus()
> > is called. So even if _cpu_up() is called, the function returns -EINAVL by
> > following added code.
> >
> > @@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
> > unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
> > struct task_struct *idle;
> >
> > - if (cpu_online(cpu) || !cpu_present(cpu))
> > - return -EINVAL;
> > -
> > cpu_hotplug_begin();
> >
> > + if (cpu_online(cpu) || !cpu_present(cpu)) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> >
>
> Right. Yasuaki's patch will ensure that there are no more race conditions
> because it does the cpu_present() check after taking the cpu_hotplug.lock.
> So I think it is safe and still doable from the kernel's perspective.
>
> But the question is, "should we do it?" I think Toshi's suggestion of failing
> the hot-remove request (if we find that the cpu has been onlined again by some
> other task) sounds like a good idea for another reason: cpu hotplug is not
> initiated by the OS by itself; its requested by the user; so if something is
> onlining the cpu back again, the user better take a second look and decide
> what exactly he wants to do with that cpu - whether keep it online or
> hot-remove it out.
>
> Trying to online as well as hot-remove the same cpu simultaneously sounds like
> a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
> (ie., failing that request) would give a warning to the user and a chance to
> reflect upon what exactly he wants to do with the cpu.
>
> So, IMHO, we should protect against the race condition (between cpu_up and
> hot-remove) but choose to fail the hot-remove request, and add a comment saying
> why we chose to fail the request, even though we could have gone ahead and
> completed it.
Agreed. Thanks for the nice summary, Srivatsa!
Thanks,
-Toshi
> Regards,
> Srivatsa S. Bhat
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/