Re: [Experimental][PATCH] putback_lru_page rework

From: Lee Schermerhorn
Date: Thu Jun 19 2008 - 10:46:05 EST


On Thu, 2008-06-19 at 09:22 +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jun 2008 14:21:06 -0400
> Lee Schermerhorn <Lee.Schermerhorn@xxxxxx> wrote:
>
> > On Wed, 2008-06-18 at 18:40 +0900, KAMEZAWA Hiroyuki wrote:
> > > Lee-san, how about this ?
> > > Tested on x86-64 and tried Nisimura-san's test at el. works good now.
> >
> > I have been testing with my work load on both ia64 and x86_64 and it
> > seems to be working well. I'll let them run for a day or so.
> >
> thank you.
> <snip>

Update:

On x86_64 [32GB, 4xdual-core Opteron], my work load has run for ~20:40
hours. Still running.

On ia64 [32G, 16cpu, 4 node], the system started going into softlockup
after ~7 hours. Stack trace [below] indicates zone-lru lock in
__page_cache_release() called from put_page(). Either heavy contention
or failure to unlock. Note that previous run, with patches to
putback_lru_page() and unmap_and_move(), the same load ran for ~18 hours
before I shut it down to try these patches.

I'm going to try again with the collected patches posted by Kosaki-san
[for which, Thanks!]. If it occurs again, I'll deconfig the unevictable
lru feature and see if I can reproduce it there. It may be unrelated to
the unevictable lru patches.

>
> > > @@ -240,6 +232,9 @@ static int __munlock_pte_handler(pte_t *
> > > struct page *page;
> > > pte_t pte;
> > >
> > > + /*
> > > + * page is never be unmapped by page-reclaim. we lock this page now.
> > > + */
> >
> > I don't understand what you're trying to say here. That is, what the
> > point of this comment is...
> >
> We access the page-table without taking pte_lock. But this vm is MLOCKED
> and migration-race is handled. So we don't need to be too nervous to access
> the pte. I'll consider more meaningful words.

OK, so you just want to note that we're accessing the pte w/o locking
and that this is safe because the vma has been VM_LOCKED and all pages
should be mlocked?

I'll note that the vma is NOT VM_LOCKED during the pte walk.
munlock_vma_pages_range() resets it so that try_to_unlock(), called from
munlock_vma_page(), won't try to re-mlock the page. However, we hold
the mmap sem for write, so faults are held off--no need to worry about a
COW fault occurring between when the VM_LOCKED was cleared and before
the page is munlocked. If that could occur, it could open a window
where a non-mlocked page is mapped in this vma, and page reclaim could
potentially unmap the page. Shouldn't be an issue as long as we never
downgrade the semaphore to read during munlock.

Lee

----------
softlockup stack trace for "usex" workload on ia64:

BUG: soft lockup - CPU#13 stuck for 61s! [usex:124359]
Modules linked in: ipv6 sunrpc dm_mirror dm_log dm_multipath scsi_dh dm_mod pci_slot fan dock thermal sg sr_mod processor button container ehci_hcd ohci_hcd uhci_hcd usbcore

Pid: 124359, CPU 13, comm: usex
psr : 00001010085a6010 ifs : 8000000000000000 ip : [<a00000010000a1a0>] Tainted: G D (2.6.26-rc5-mm3-kame-rework+mcl_inherit)
ip is at ia64_spinlock_contention+0x20/0x60
unat: 0000000000000000 pfs : 0000000000000081 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : a65955959a96e969
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001001264a0 b6 : a0000001006f0350 b7 : a00000010000b940
f6 : 0ffff8000000000000000 f7 : 1003ecf3cf3cf3cf3cf3d
f8 : 1003e0000000000000001 f9 : 1003e0000000000000015
f10 : 1003e000003a82aaab1fb f11 : 1003e0000000000000000
r1 : a000000100c03650 r2 : 000000000000038a r3 : 0000000000000001
r8 : 00000010085a6010 r9 : 0000000000080028 r10 : 000000000000000b
r11 : 0000000000000a80 r12 : e0000741aaac7d50 r13 : e0000741aaac0000
r14 : 0000000000000000 r15 : a000400741329148 r16 : e000074000060100
r17 : e000076000078e98 r18 : 0000000000000015 r19 : 0000000000000018
r20 : 0000000000000003 r21 : 0000000000000002 r22 : e000076000078e88
r23 : e000076000078e80 r24 : 0000000000000001 r25 : 0240000000080028
r26 : ffffffffffff04d8 r27 : 00000010085a6010 r28 : 7fe3382473f8b380
r29 : 9c00000000000000 r30 : 0000000000000001 r31 : e000074000061400

Call Trace:
[<a000000100015e00>] show_stack+0x80/0xa0
sp=e0000741aaac79b0 bsp=e0000741aaac1528
[<a000000100016700>] show_regs+0x880/0x8c0
sp=e0000741aaac7b80 bsp=e0000741aaac14d0
[<a0000001000fbbe0>] softlockup_tick+0x2e0/0x340
sp=e0000741aaac7b80 bsp=e0000741aaac1480
[<a0000001000a9400>] run_local_timers+0x40/0x60
sp=e0000741aaac7b80 bsp=e0000741aaac1468
[<a0000001000a9460>] update_process_times+0x40/0xc0
sp=e0000741aaac7b80 bsp=e0000741aaac1438
[<a00000010003ded0>] timer_interrupt+0x1b0/0x4a0
sp=e0000741aaac7b80 bsp=e0000741aaac13d0
[<a0000001000fc480>] handle_IRQ_event+0x80/0x120
sp=e0000741aaac7b80 bsp=e0000741aaac1398
[<a0000001000fc660>] __do_IRQ+0x140/0x440
sp=e0000741aaac7b80 bsp=e0000741aaac1338
[<a0000001000136d0>] ia64_handle_irq+0x3f0/0x420
sp=e0000741aaac7b80 bsp=e0000741aaac12c0
[<a00000010000c120>] ia64_native_leave_kernel+0x0/0x270
sp=e0000741aaac7b80 bsp=e0000741aaac12c0
[<a00000010000a1a0>] ia64_spinlock_contention+0x20/0x60
sp=e0000741aaac7d50 bsp=e0000741aaac12c0
[<a0000001006f0350>] _spin_lock_irqsave+0x50/0x60
sp=e0000741aaac7d50 bsp=e0000741aaac12b8

Probably zone lru_lock in __page_cache_release().

[<a0000001001264a0>] put_page+0x100/0x300
sp=e0000741aaac7d50 bsp=e0000741aaac1280
[<a000000100157170>] free_page_and_swap_cache+0x70/0xe0
sp=e0000741aaac7d50 bsp=e0000741aaac1260
[<a000000100145a10>] exit_mmap+0x3b0/0x580
sp=e0000741aaac7d50 bsp=e0000741aaac1210
[<a00000010008b420>] mmput+0x80/0x1c0
sp=e0000741aaac7e10 bsp=e0000741aaac11d8

NOTE: all cpus show similar stack traces above here. Some, however, get
here from do_exit()/exit_mm(), rather than via execve().

[<a00000010019c2c0>] flush_old_exec+0x5a0/0x1520
sp=e0000741aaac7e10 bsp=e0000741aaac10f0
[<a000000100213080>] load_elf_binary+0x7e0/0x2600
sp=e0000741aaac7e20 bsp=e0000741aaac0fb8
[<a00000010019b7a0>] search_binary_handler+0x1a0/0x520
sp=e0000741aaac7e20 bsp=e0000741aaac0f30
[<a00000010019e4e0>] do_execve+0x320/0x3e0
sp=e0000741aaac7e20 bsp=e0000741aaac0ed0
[<a000000100014d00>] sys_execve+0x60/0xc0
sp=e0000741aaac7e30 bsp=e0000741aaac0e98
[<a00000010000b690>] ia64_execve+0x30/0x140
sp=e0000741aaac7e30 bsp=e0000741aaac0e48
[<a00000010000bfa0>] ia64_ret_from_syscall+0x0/0x20
sp=e0000741aaac7e30 bsp=e0000741aaac0e48
[<a000000000010720>] __start_ivt_text+0xffffffff00010720/0x400
sp=e0000741aaac8000 bsp=e0000741aaac0e48



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/