Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely

From: zijun_hu
Date: Thu Oct 13 2016 - 20:17:04 EST


On 2016/10/14 7:29, Tejun Heo wrote:
> On Tue, Oct 11, 2016 at 10:00:28PM +0800, zijun_hu wrote:
>> From: zijun_hu <zijun_hu@xxxxxxx>
>>
>> as shown by pcpu_build_alloc_info(), the number of units within a percpu
>> group is educed by rounding up the number of CPUs within the group to
>> @upa boundary, therefore, the number of CPUs isn't equal to the units's
>> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk()
>> uses BUG_ON() to assert one number is equal the other roughly, so a panic
>> is maybe triggered by the BUG_ON() falsely.
>>
>> in order to fix this issue, the number of CPUs is rounded up then compared
>> with units's, the BUG_ON() is replaced by warning and returning error code
>> as well to keep system alive as much as possible.
>
> I really can't decode what the actual issue is here. Can you please
> give an example of a concrete case?
>
the right relationship between the number of CPUs @nr_cpus within a percpu group
and the number of unites @nr_units within the same group is that
@nr_units == roundup(@nr_cpus, @upa);

the process of consideration is shown as follows:

1ï current code segments:

BUG_ON(ai->nr_groups != 1);
BUG_ON(ai->groups[0].nr_units != num_possible_cpus());

2ï changes for considering the right relationship between the number of CPUs and units

BUG_ON(ai->nr_groups != 1);
BUG_ON(ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa));

3) replace BUG_ON() by warning and returning error code since it seems BUG_ON() isn't
nice as shown by linus recent LKML mail

BUG_ON(ai->nr_groups != 1);
if (ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa))
return -EINVAL;

so 3) is my finial changes;
for the relationship of both numbers : see the reply for andrew

>> @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size,
>>
>> /* allocate pages */
>> j = 0;
>> - for (unit = 0; unit < num_possible_cpus(); unit++)
>> + for (unit = 0; unit < num_possible_cpus(); unit++) {
>> + unsigned int cpu = ai->groups[0].cpu_map[unit];
>> for (i = 0; i < unit_pages; i++) {
>> - unsigned int cpu = ai->groups[0].cpu_map[unit];
>> void *ptr;
>>
>> ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE);
>> if (!ptr) {
>> pr_warn("failed to allocate %s page for cpu%u\n",
>> - psize_str, cpu);
>> + psize_str, cpu);
>
> And stop making gratuitous changes?
>

this changes is just for looking nicer instinctively
@cpu can be determined in the first outer loop.

> Thanks.
>