Re: Detecting page cache trashing state

From: Zdenek Kabelac
Date: Fri Sep 15 2017 - 07:55:44 EST


Dne 15.9.2017 v 02:16 Taras Kondratiuk napsal(a):
Hi

In our devices under low memory conditions we often get into a trashing
state when system spends most of the time re-reading pages of .text
sections from a file system (squashfs in our case). Working set doesn't
fit into available page cache, so it is expected. The issue is that
OOM killer doesn't get triggered because there is still memory for
reclaiming. System may stuck in this state for a quite some time and
usually dies because of watchdogs.

We are trying to detect such trashing state early to take some
preventive actions. It should be a pretty common issue, but for now we
haven't find any existing VM/IO statistics that can reliably detect such
state.

Most of metrics provide absolute values: number/rate of page faults,
rate of IO operations, number of stolen pages, etc. For a specific
device configuration we can determine threshold values for those
parameters that will detect trashing state, but it is not feasible for
hundreds of device configurations.

We are looking for some relative metric like "percent of CPU time spent
handling major page faults". With such relative metric we could use a
common threshold across all devices. For now we have added such metric
to /proc/stat in our kernel, but we would like to find some mechanism
available in upstream kernel.

Has somebody faced similar issue? How are you solving it?

Hi

Well I witness this when running Firefox & Thunderbird on my desktop for a while on just 4G RAM machine till these 2app eat all free RAM...

It gets to the position (when I open new tab) that mouse hardly moves - kswapd eats CPU (I've no swap in fact - so likely just page-caching).

The only 'quick' solution for me as desktop user is to manually invoke OOM
with SYSRQ+F key - and I'm also wondering why the system is not reacting better. In most cases it kills one of those 2 - but sometime it kills whole Xsession...


Regards

Zdenek