Re: [PATCH]smp: Fix send func call IPI to empty cpu mask

From: Linus Torvalds
Date: Sat Jan 26 2013 - 15:06:11 EST


On Fri, Jan 25, 2013 at 11:53 PM, Wang YanQing <udknight@xxxxxxxxx> wrote:
> I get below warning every day with 3.7,
> one or two times per day.
>
> [ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()
> [ 2235.186030] Hardware name: Aspire 4741
> [ 2235.186032] empty IPI mask
> [ 2235.186079] [<c1015cbc>] native_send_call_func_ipi+0x4f/0x57
> [ 2235.186087] [<c1053453>] smp_call_function_many+0x191/0x1a9
> [ 2235.186097] [<c101e074>] native_flush_tlb_others+0x21/0x24
> [ 2235.186101] [<c101e0da>] flush_tlb_page+0x63/0x89
> [ 2235.186105] [<c101d360>] ptep_set_access_flags+0x20/0x26
> [ 2235.186111] [<c108fadd>] do_wp_page+0x234/0x502
> [ 2235.186121] [<c1090825>] handle_pte_fault+0x50d/0x54c
> [ 2235.186148] [<c1090934>] handle_mm_fault+0xd0/0xe2
> [ 2235.186153] [<c12dd143>] __do_page_fault+0x411/0x42d
> [ 2235.186166] [<c12dd167>] do_page_fault+0x8/0xa
> [ 2235.186170] [<c12db31a>] error_code+0x5a/0x60
>
> This patch fix it.
>
> This patch also fix some system hang problem:
> If the data->cpumask been cleared after pass
>
> if (WARN_ONCE(!mask, "empty IPI mask"))
> return;
> then the problem 83d349f3 fix will happen again.

Hmm. We have very consciously tried to avoid the extra copy, although
I'm not entirely sure why (it might possibly hurt on the MAXSMP
configuration).

See for example commit 723aae25d5cd ("smp_call_function_many: handle
concurrent clearing of mask") which fixed another version of this
problem.

But I do agree that it looks like the copy is required, simply because
- as you say - once we've done the "list_add_rcu()" to add it to the
queue, we can have (another) IPI to the target CPU that can now see it
and clear the mask.

So by the time we get to actually send the IPI, the mask might have
been cleared by another IPI. So I do agree that your patch seems
correct, but I really really want to run it by other people.

Guys? Original patch on lkml. The other possible fix might be to take
the &call_function.lock earlier in
generic_smp_call_function_interrupt(), so that we can never clear the
bit while somebody is adding entries to the list... But I think it
very much tries to avoid that on purpose right now, with only the last
CPU responding to that IPI taking the lock.

So copying the IPI mask seems to be the reasonable approach. Comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/