Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)
From: Andrew Lutomirski
Date: Sun May 22 2011 - 08:22:52 EST
On Sat, May 21, 2011 at 10:44 AM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
> Hi Andrew.
>
> On Sat, May 21, 2011 at 10:34 PM, Andrew Lutomirski <luto@xxxxxxx> wrote:
>> On Sat, May 21, 2011 at 8:04 AM, KOSAKI Motohiro
>> <kosaki.motohiro@xxxxxxxxxxxxxx> wrote:
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 3f44b81..d1dabc9 100644
>>>> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
>>>> struct zone *zone,
>>>>
>>>> /* Check if we should syncronously wait for writeback */
>>>> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>>>> + unsigned long nr_active, old_nr_scanned;
>>>> set_reclaim_mode(priority, sc, true);
>>>> + nr_active = clear_active_flags(&page_list, NULL);
>>>> + count_vm_events(PGDEACTIVATE, nr_active);
>>>> + old_nr_scanned = sc->nr_scanned;
>>>> nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>>>> + sc->nr_scanned = old_nr_scanned;
>>>> }
>>>>
>>>> local_irq_disable();
>>>>
>>>> I just tested 2.6.38.6 with the attached patch. It survived dirty_ram
>>>> and test_mempressure without any problems other than slowness, but
>>>> when I hit ctrl-c to stop test_mempressure, I got the attached oom.
>>>
>>> Minchan,
>>>
>>> I'm confused now.
>>> If pages got SetPageActive(), should_reclaim_stall() should never return true.
>>> Can you please explain which bad scenario was happen?
>>>
>>> -----------------------------------------------------------------------------------------------------
>>> static void reset_reclaim_mode(struct scan_control *sc)
>>> {
>>> sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
>>> }
>>>
>>> shrink_page_list()
>>> {
>>> (snip)
>>> activate_locked:
>>> SetPageActive(page);
>>> pgactivate++;
>>> unlock_page(page);
>>> reset_reclaim_mode(sc); /// here
>>> list_add(&page->lru, &ret_pages);
>>> }
>>> -----------------------------------------------------------------------------------------------------
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------
>>> bool should_reclaim_stall()
>>> {
>>> (snip)
>>>
>>> /* Only stall on lumpy reclaim */
>>> if (sc->reclaim_mode & RECLAIM_MODE_SINGLE) /// and here
>>> return false;
>>> -----------------------------------------------------------------------------------------------------
>>>
>>
>> I did some tracing and the oops happens from the second call to
>> shrink_page_list after should_reclaim_stall returns true and it hits
>> the same pages in the same order that the earlier call just finished
>> calling SetPageActive on. I have *not* confirmed that the two calls
>> happened from the same call to shrink_inactive_list, but something's
>> certainly wrong in there.
>>
>> This is very easy to reproduce on my laptop.
>
> I would like to confirm this problem.
> Could you show the diff of 2.6.38.6 with current your 2.6.38.6 + alpha?
> (ie, I would like to know that what patches you add up on vanilla
> 2.6.38.6 to reproduce this problem)
> I believe you added my crap below patch. Right?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..69d317e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -311,7 +311,8 @@ static void set_reclaim_mode(int priority, struct
> scan_control *sc,
> */
> if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> sc->reclaim_mode |= syncmode;
> - else if (sc->order && priority < DEF_PRIORITY - 2)
> + else if ((sc->order && priority < DEF_PRIORITY - 2) ||
> + prioiry <= DEF_PRIORITY / 3)
> sc->reclaim_mode |= syncmode;
> else
> sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
> @@ -1349,10 +1350,6 @@ static inline bool
> should_reclaim_stall(unsigned long nr_taken,
> if (current_is_kswapd())
> return false;
>
> - /* Only stall on lumpy reclaim */
> - if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> - return false;
> -
Bah. It's this last hunk. Without this I can't reproduce the oops.
With this hunk, the reset_reclaim_mode doesn't work and
shrink_page_list is incorrectly called twice.
So we're back to the original problem...
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/