Re: [RFC][PATCH 0/8] Critical Page Pool

From: Matthew Dobson
Date: Fri Nov 18 2005 - 14:50:51 EST


Avi Kivity wrote:
> Matthew Dobson wrote:
>
>> We have a clustering product that needs to be able to guarantee that the
>> networking system won't stop functioning in the case of OOM/low memory
>> condition. The current mempool system is inadequate because to keep the
>> whole networking stack functioning, we need more than 1 or 2 slab
>> caches to
>> be guaranteed. We need to guarantee that any request made with a
>> specific
>> flag will succeed, assuming of course that you've made your "critical
>> page
>> pool" big enough.
>>
>> The following patch series implements such a critical page pool. It
>> creates 2 userspace triggers:
>>
>> /proc/sys/vm/critical_pages: write the number of pages you want to
>> reserve
>> for the critical pool into this file
>>
>> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>> the system is in an emergency state and authorize the kernel to dip into
>> the critical pool to satisfy critical allocations.
>>
>> We mark critical allocations with the __GFP_CRITICAL flag, and when the
>> system is in an emergency state, we are allowed to delve into this
>> pool to
>> satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>> normal means.
>>
>>
>>
> 1. If you have two subsystems which allocate critical pages, how do you
> protect against the condition where one subsystem allocates all the
> critical memory, causing the second to oom?

You don't. You make sure that you size the critical pool appropriately for
your workload.


> 2. There already exists a critical pool: ordinary allocations fail if
> free memory is below some limit, but special processes (kswapd) can
> allocate that memory by setting PF_MEMALLOC. Perhaps this should be
> extended, possibly with a per-process threshold.

The exception for threads with PF_MEMALLOC set is there because those
threads are essentially promising that if the kernel gives them memory,
they will use that memory to free up MORE memory. If we ignore that
promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
zone_watermarks, we'll simply OOM faster, and potentially in situations
that could be avoided (ie: we steal memory that kswapd could have used to
free up more memory).

Thanks for your feedback!

-Matt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/