Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl forprotecting the working set

From: Minchan Kim
Date: Wed Nov 03 2010 - 19:49:30 EST


Hello.

On Thu, Nov 4, 2010 at 7:40 AM, Mandeep Singh Baines <msb@xxxxxxxxxxxx> wrote:
> Rik van Riel (riel@xxxxxxxxxx) wrote:
>> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>>
>> >Yes, this prevents you from reclaiming the active list all at once. But if the
>> >memory pressure doesn't go away, you'll start to reclaim the active list
>> >little by little. First you'll empty the inactive list, and then
>> >you'll start scanning
>> >the active list and pulling pages from inactive to active. The problem is that
>> >there is no minimum time limit to how long a page will sit in the inactive list
>> >before it is reclaimed. Just depends on scan rate which does not depend
>> >on time.
>> >
>> >In my experiments, I saw the active list get smaller and smaller
>> >over time until eventually it was only a few MB at which point the system came
>> >grinding to a halt due to thrashing.
>>
>> I believe that changing the active/inactive ratio has other
>> potential thrashing issues.  Specifically, when the inactive
>> list is too small, pages may not stick around long enough to
>> be accessed multiple times and get promoted to the active
>> list, even when they are in active use.
>>
>> I prefer a more flexible solution, that automatically does
>> the right thing.
>>
>> The problem you see is that the file list gets reclaimed
>> very quickly, even when it is already very small.
>>
>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file < 2 * zone->pages_high ?
>>
>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk.  Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>>
>> This way HPC and virtual machine hosting nodes can still
>> get rid of totally unused page cache, but on any system
>> that actually uses page cache, some minimal amount of
>> cache will be protected under heavy memory pressure.
>>
>> Does this sound like a reasonable approach?
>>
>> I realize the threshold may have to be tweaked...
>>
>> The big question is, how do we integrate this with the
>> OOM killer?  Do we pretend we are out of memory when
>> we've hit our file cache eviction quota and kill something?
>>
>> Would there be any downsides to this approach?
>>
>> Are there any volunteers for implementing this idea?
>> (Maybe someone who needs the feature?)
>>
>
> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?
>
> ---
>
> [PATCH] vmscan: add a configurable scan interval
>
> On ChromiumOS, we see a lot of thrashing under low memory. We do not
> use swap, so the mm system can only free file-backed pages. Eventually,
> we are left with little file back pages remaining (a few MB) and the
> system becomes unresponsive due to thrashing.
>
> Our preference is for the system to OOM instead of becoming unresponsive.
>
> This patch create a tunable, vmscan_interval_centisecs, for controlling
> the minimum interval between active list scans. At 0, I see the same
> thrashing. At 1, I see no thrashing. The mm system does a good job
> of protecting the working set. If a page has been referenced in the
> last vmscan_interval_centisecs it is kept in memory.
>
> Signed-off-by: Mandeep Singh Baines <msb@xxxxxxxxxxxx>

vmscan already have used HZ/10 to calm down congestion of writeback or
something.
(But I don't know why VM used the value and who determined it by any
rationale. It might be a value determined by some experiments.)
If there isn't any good math, we will depend on experiment in this time, too.

Anyway If interval is long, It could make inactive list's size very
shortly in many reclaim workload and then unnecessary OOM kill.
So I hope if inactive list size is very small compared to active list
size, quit the check and refiill the inactive list.

Anyway, the approach makes sense to me.
But need other guy's opinion.

Nitpick :
I expect you will include description of knob in
Documentation/sysctl/vm.txt in your formal patch.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/