Re: [PATCH 4/4] cpuset,mm: use rwlock to protect task->mempolicyand mems_allowed

From: Lee Schermerhorn
Date: Fri Mar 05 2010 - 08:05:14 EST


On Thu, 2010-03-04 at 09:31 -0500, Lee Schermerhorn wrote:
> On Wed, 2010-03-03 at 18:52 +0800, Miao Xie wrote:
> > if MAX_NUMNODES > BITS_PER_LONG, loading/storing task->mems_allowed or mems_allowed in
> > task->mempolicy are not atomic operations, and the kernel page allocator gets an empty
> > mems_allowed when updating task->mems_allowed or mems_allowed in task->mempolicy. So we
> > use a rwlock to protect them to fix this probelm.
> >
> > Signed-off-by: Miao Xie <miaox@xxxxxxxxxxxxxx>
> > ---
> > include/linux/cpuset.h | 104 +++++++++++++++++++++++++++++-
> > include/linux/init_task.h | 8 +++
> > include/linux/mempolicy.h | 24 ++++++--
> > include/linux/sched.h | 17 ++++-
> > kernel/cpuset.c | 113 +++++++++++++++++++++++++++------
> > kernel/exit.c | 4 +
> > kernel/fork.c | 13 ++++-
> > mm/hugetlb.c | 3 +
> > mm/mempolicy.c | 153 ++++++++++++++++++++++++++++++++++----------
> > mm/slab.c | 27 +++++++-
> > mm/slub.c | 10 +++
> > 11 files changed, 403 insertions(+), 73 deletions(-)
> >
> <snip>
> >
<snip even more>
> > @@ -1381,8 +1434,16 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
> > } else if (vma->vm_policy)
> > pol = vma->vm_policy;
> > }
> > + if (!pol) {
> > + read_mem_lock_irqsave(task, irqflags);
> > + pol = task->mempolicy;
> > + mpol_get(pol);
> > + read_mem_unlock_irqrestore(task, irqflags);
> > + }
> > +
>
> Please note that this change is in the fast path of task page
> allocations. We tried real hard when reworking the mempolicy reference
> counts not to reference count the task's mempolicy because only the task
> could change its' own task mempolicy. cpuset rebinding breaks this
> assumption, of course.
>
> I'll run some page fault overhead tests on this series to see whether
> the effect of the additional lock round trip and reference count is
> measurable and unacceptable.

Well, I wanted to run page fault overhead tests, but when I added this
series to the 3March mmotm [applied w/ offsets, no rejects], I got NULL
pointer derefs during boot, and then the remainder of the boot crept
along for a while with looooong pauses between chunks of console output.
Finally appeared to hang.

Stack traces below.

Config available on request.

Lee

-----------------------
Excerpt from console output:

Platform is 8 socket x 6 core AMD numa w/ 512GB memory.

...
Loading drivers, configuring devices: input: Power Button
as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
ACPI: Power Button [PWRF]
ipmi message handler version 39.2
hpilo 0000:00:04.2: PCI INT B -> GSI 41 (level, low) -> IRQ 41
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
BUG: unable to handle kernel NULL pointer dereference at
0000000000000001
IP: [<ffffffff8106197e>] sysctl_check_table+0x277/0x35e
PGD 3049162067 PUD 3049056067 PMD 0
input: PC Speaker as /devices/platform/pcspkr/input/input1
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/LNXSYSTM:00/modalias
CPU 10
Modules linked in: pcspkr cdrom(+) pci_hotplug hpilo ipmi_msghandler
i2c_core container button ohci_hcd uhci_hcd ehci_hcd usbcore edd ext3
mbcache jbd fan ide_pci_generic serverworks ide_core ata_generic
pata_serverworks libata cciss scsi_mod thermal processor thermal_sys
hwmon

Pid: 2429, comm: modprobe Not tainted
2.6.33-mmotm-100302-1838-mx-mempolicy #6 /ProLiant DL785 G6
RIP: 0010:[<ffffffff8106197e>] [<ffffffff8106197e>] sysctl_check_table
+0x277/0x35e
RSP: 0018:ffff8840484a3d58 EFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffffa0184340 RCX: ffff883049d090c0
RDX: ffffffff81530864 RSI: ffff88104895ca00 RDI: ffffffff816e4910
RBP: ffff8840484a3da8 R08: ffffffff810bb073 R09: 0000000000000000
R10: ffff883049d7eb40 R11: 0000000000000002 R12: 0000000000000001
R13: ffffffffa0184210 R14: 0000000000000002 R15: ffff88104895ca00
FS: 00007f8c4c7316f0(0000) GS:ffff882088240000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000001 CR3: 0000003049250000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 2429, threadinfo ffff8840484a2000, task
ffff8840493aa140)
Stack:
0000000000000000 ffffffff816ab520 0000000200000000 ffff882050001c08
<0> 0000000000000000 ffffffffa01842c0 ffff883049d090c0 ffffffff8140aba0
<0> ffffffff816a4430 ffffffffa0184210 ffff8840484a3e08 ffffffff81061a0d
Call Trace:
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffffa017e004>] ? cdrom_dummy_generic_packet+0x4/0x3c [cdrom]
[<ffffffff8101c461>] ? do_ftrace_mod_code+0xb5/0x147
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffff81049537>] ? __register_sysctl_paths+0x4a/0x297
[<ffffffff810495ec>] __register_sysctl_paths+0xff/0x297
[<ffffffff81094a1b>] ? tracepoint_module_notify+0x2c/0x30
[<ffffffff812f05cf>] ? notifier_call_chain+0x38/0x60
[<ffffffffa00b5000>] ? cdrom_init+0x0/0x6a [cdrom]
EDAC MC: Ver: 2.1.0 Mar 4 2010
[<ffffffff810497b2>] register_sysctl_paths+0x2e/0x30
[<ffffffff810497cc>] register_sysctl_table+0x18/0x1a
[<ffffffffa00b5019>] cdrom_init+0x19/0x6a [cdrom]
[<ffffffff8100020e>] do_one_initcall+0x73/0x180
[<ffffffff810714d8>] sys_init_module+0xd5/0x22e
[<ffffffff81002b9b>] system_call_fastpath+0x16/0x1b
Code: 7c 24 18 00 74 21 49 8b 7d 00 48 85 ff 74 18 e8 3d fb 12 00 85 c0
75 0f 45 85 f6 74 2e 41 ff ce 4d 8b 64 24 18 eb b9 49 83 c4 40 <49> 8b
34 24 48 85 f6 75 c5 4c 89 fe 48 8b 7d b8 e8 b7 7e fe ff
RIP [<ffffffff8106197e>] sysctl_check_table+0x277/0x35e
RSP <ffff8840484a3d58>
CR2: 0000000000000001
QLogic/NetXen Network Driver v4.0.72
netxen_nic 0000:04:00.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
netxen_nic 0000:04:00.0: setting latency timer to 64
---[ end trace 42261946992ac8eb ]---

Then, another one:

...
netxen_nic: Dual XGb SFP+ LP Board S/N CM9BBK0915 Chip rev 0x42
netxen_nic 0000:04:00.0: firmware v4.0.406 [cut-through]
IPMI System Interface driver.
ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca2,
slave address 0x20, irq 0
netxen_nic 0000:04:00.0: irq 72 for MSI/MSI-X
netxen_nic 0000:04:00.0: irq 73 for MSI/MSI-X
netxen_nic 0000:04:00.0: irq 74 for MSI/MSI-X
netxen_nic 0000:04:00.0: irq 75 for MSI/MSI-X
netxen_nic 0000:04:00.0: using msi-x interrupts
BUG: unable to handle kernel NULL pointer dereference at
0000000000000001
IP: [<ffffffff8106197e>] sysctl_check_table+0x277/0x35e
PGD 0
Oops: 0000 [#2] SMP
last sysfs
file: /sys/devices/pci0000:00/0000:00:06.1/host0/target0:0:0/0:0:0:0/type
CPU 0
Modules linked in: ipmi_si(+) shpchp(+) rtc_cmos hid i2c_piix4 tpm_bios
amd64_edac_mod sg rtc_core rtc_lib serio_raw netxen_nic(+) edac_core
pcspkr cdrom(+) pci_hotplug hpilo ipmi_msghandler i2c_core container
button ohci_hcd uhci_hcd ehci_hcd usbcore edd ext3 mbcache jbd fan
ide_pci_generic serverworks ide_core ata_generic pata_serverworks libata
cciss scsi_mod thermal processor thermal_sys hwmon

Pid: 3105, comm: work_for_cpu Tainted: G D
2.6.33-mmotm-100302-1838-mx-mempolicy #6 /ProLiant DL785 G6
RIP: 0010:[<ffffffff8106197e>] [<ffffffff8106197e>] sysctl_check_table
+0x277/0x35e
RSP: 0018:ffff8870495ff9a0 EFLAGS: 00010246
RAX: 0000000000000004 RBX: ffff881048432808 RCX: ffff8810451eb9f0
RDX: ffffffff81530869 RSI: ffff88104895ca00 RDI: ffffffff816e4910
RBP: ffff8870495ff9f0 R08: 0000000000000083 R09: ffffffff81946190
R10: ffff880001a16178 R11: ffff8870495ffc70 R12: 0000000000000001
R13: ffff8810451eb858 R14: 0000000000000004 R15: ffff88104895ca00
FS: 00007f65e914e6f0(0000) GS:ffff880001a00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000001 CR3: 0000000001693000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process work_for_cpu (pid: 3105, threadinfo ffff8870495fe000, task
ffff8870493b8680)
Stack:
ffff8870495ffac0 ffff8870495ffc00 0000000400000000 00000000000412d0
<0> 0000000000000000 ffff8810451eb9d8 ffff8810451eb800 ffff8870495ffc70
<0> ffffffff816e4910 ffff881048432808 ffff8870495ffa50 ffffffff81061a0d
Call Trace:
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffff81061a0d>] sysctl_check_table+0x306/0x35e
[<ffffffff81048591>] ? sysctl_set_parent+0x29/0x38
[<ffffffff810495ec>] __register_sysctl_paths+0xff/0x297
[<ffffffff812d6b4b>] register_net_sysctl_table+0x50/0x55
[<ffffffff8127207e>] neigh_sysctl_register+0x1e4/0x21d
[<ffffffff810474bf>] ? local_bh_enable_ip+0xc1/0xc6
[<ffffffff812b6d40>] devinet_sysctl_register+0x29/0x44
[<ffffffff812b6e76>] inetdev_init+0x11b/0x158
[<ffffffff812b6f13>] inetdev_event+0x60/0x3d6
[<ffffffff81225706>] ? device_add+0x46e/0x541
[<ffffffff812f05cf>] notifier_call_chain+0x38/0x60
[<ffffffff8105e338>] raw_notifier_call_chain+0x14/0x16
[<ffffffff8126e142>] register_netdevice+0x346/0x3c2
[<ffffffff8126e1fd>] register_netdev+0x3f/0x4d
[<ffffffffa01a7b0e>] netxen_nic_probe+0x8b9/0xac4 [netxen_nic]
[<ffffffff81056910>] ? do_work_for_cpu+0x0/0x2a
[<ffffffff811a54b5>] local_pci_probe+0x17/0x1b
ipmi: Found new BMC (man_id: 0x00000b, prod_id: 0x0000, dev_id: 0x11)
IPMI kcs interface initialized
ipmi_si: Trying SPMI-specified kcs state machine at i/o address 0xca2,
slave address 0x0, irq 0
ipmi_si: duplicate interface
[<ffffffff81056928>] do_work_for_cpu+0x18/0x2a
[<ffffffff81056910>] ? do_work_for_cpu+0x0/0x2a
[<ffffffff81059d54>] kthread+0x82/0x8a
[<ffffffff81003994>] kernel_thread_helper+0x4/0x10
[<ffffffff812ed5e9>] ? restore_args+0x0/0x30
[<ffffffff81059cd2>] ? kthread+0x0/0x8a
[<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
Code: 7c 24 18 00 74 21 49 8b 7d 00 48 85 ff 74 18 e8 3d fb 12 00 85 c0
75 0f 45 85 f6 74 2e 41 ff ce 4d 8b 64 24 18 eb b9 49 83 c4 40 <49> 8b
34 24 48 85 f6 75 c5 4c 89 fe 48 8b 7d b8 e8 b7 7e fe ff
RIP [<ffffffff8106197e>] sysctl_check_table+0x277/0x35e
RSP <ffff8870495ff9a0>
CR2: 0000000000000001
---[ end trace 42261946992ac8ec ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/