Re: [PATCH] mm:workingset use real time to judge activity of the file page

From: Johannes Weiner
Date: Fri Apr 05 2019 - 15:34:41 EST


On Fri, Apr 05, 2019 at 07:23:46AM +0800, Zhaoyang Huang wrote:
> On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > >
> > > In previous implementation, the number of refault pages is used
> > > for judging the refault period of each page, which is not precised as
> > > eviction of other files will be affect a lot on current cache.
> > > We introduce the timestamp into the workingset's entry and refault ratio
> > > to measure the file page's activity. It helps to decrease the affection
> > > of other files(average refault ratio can reflect the view of whole system
> > > 's memory).
> >
> > I don't understand what exactly you're saying here, can you please
> > elaborate?
> >
> > The reason it's using distances instead of absolute time is because
> > the ordering of the LRU is relative and not based on absolute time.
> >
> > E.g. if a page is accessed every 500ms, it depends on all other pages
> > to determine whether this page is at the head or the tail of the LRU.
> >
> > So when you refault, in order to determine the relative position of
> > the refaulted page in the LRU, you have to compare it to how fast that
> > LRU is moving. The absolute refault time, or the average time between
> > refaults, is not comparable to what's already in memory.
> How do you know how long time did these pages' dropping taken.Actruly,
> a quick dropping of large mount of pages will be wrongly deemed as
> slow dropping instead of the exact hard situation.That is to say, 100
> pages per million second or per second have same impaction on
> calculating the refault distance, which may cause less protection on
> this page cache for former scenario and introduce page thrashing.
> especially when global reclaim, a round of kswapd reclaiming that
> waked up by a high order allocation or large number of single page
> allocations may cause such things as all pages within the node are
> counted in the same lru. This commit can decreasing above things by
> comparing refault time of single page with avg_refault_time =
> delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio =
> lru->inactive_ages / time).

When something like a higher-order allocation drops a large number of
file pages, it's *intentional* that the pages that were evicted before
them become less valuable and less likely to be activated on
refault. There is a finite amount of in-memory LRU space and the pages
that have been evicted the most recently have precedence because they
have the highest proven access frequency.

Of course, when a large amount of the cache that was pushed out in
between is not re-used again, and don't claim their space in memory,
it would be great if we could then activate the older pages that *are*
re-used again in their stead.

But that would require us being able to look into the future. When an
old page refaults, we don't know if a younger page is still going to
refault with a shorter refault distance or not. If it won't, then we
were right to activate it. If it will refault, then we put something
on the active list whose reuse frequency is too low to be able to fit
into memory, and we thrash the hottest pages in the system.

As Matthew says, you are fairly randomly making refault activations
more aggressive (especially with that timestamp unpacking bug), and
while that expectedly boosts workload transition / startup, it comes
at the cost of disrupting stable states because you can flood a very
active in-ram workingset with completely cold cache pages simply
because they refault uniformly wrt each other.