Re: 2.6.0-test9 - poor swap performance on low end machines

From: Roger Luethi
Date: Mon Dec 08 2003 - 14:53:29 EST


I've been looking at this during the past few months. I will sketch
out a few of my findinds below. I can follow up with some details and
actual data if necessary.

On Mon, 08 Dec 2003 05:52:25 -0800, William Lee Irwin III wrote:
> Explicit load control is in order. 2.4 appears to work better in these
> instances because it victimizes one process at a time. It vaguely
> resembles load control with a random demotion policy (mmlist order is

Everybody I talked to seemed to assume that 2.4 does better due to the
way mapped pages are freed (i.e. swap_out in 2.4). While it is true
that the new VM as merged in 2.5.27 didn't exactly help with thrashing
performance, the main factors slowing 2.6 down were merged much later.

Have a look at the graph attached to this message to get an idea of
what I am talking about (x axis is kernel releases after 2.5.0, y axis
is time to complete each benchmark).

It is important to note that different work loads show different
thrashing behavior. Some changes in 2.5 improved one thrashing benchmark
and made another worse. However, 2.4 seems to do better than 2.6 across
the board, which suggests that some elements are in fact better for
any types of thrashing.

> Other important aspects of load control beyond the demotion policy are
> explicit suspension the execution contexts of the process address
> spaces chosen as its victims, complete eviction of the process address

I implemented suspension during memory shortage for 2.6 and I had some
code for complete eviction as well. It definitely helped for some
benchmarks. There's one problem, though: Latency. If a machine is
thrashing, a sys admin won't appreciate that her shell is suspended
when she tries to log in to correct the problem. I have some simple
criteria for selecting a process to suspend, but it's hard to get it
right every time (kind of like the OOM killer, although with smaller
damage for bad decisions).

For workstations and most servers latency is so important compared to
throughput that I began to wonder whether implementing suspension was
actually worth it. After benchmarking 2.4 vs 2.6, though, I suspected
that there must be plenty of room for improvement _before_ such drastic
measures are necessary. It makes little sense to add suspension to 2.6
if performance can be improved _without_ hurting latency. That's why
I shelved my work on suspension to find out and document when exactly
performance went down during 2.5.

> 2.4 does not do any of this.
>
> The effect of not suspending the execution contexts of the demoted
> process address spaces is that the victimized execution contexts thrash
> while trying to reload the memory they need to execute. The effect of
> incomplete demotion is essentially livelock under sufficient stress.
> Its memory scheduling to what extent it has it is RR and hence fair,
> but the various caveats above justify "does not do any of this",
> particularly incomplete demotion.

One thing you can observe with 2.4 is that one process may force another
process out. Say you have several instances of the same program which
all have the same working set size (i.e. requirements, not RSS) and
a constant rate of memory references in the code. If their current RSS
differ then some take more major faults and spend more time blocked than
others. In a thrashing situation, you can see the small RSSs shrink
to virtually zero, while the largest RSS will grow even further --
the thrashing processes are stealing each other's pages while the one
which hardly ever faults keeps its complete working set in RAM. Bad for
fairness, but can help throughput quite a bit. This effect is harder
to trigger in 2.6.

> So I predict that a true load control mechanism and policy would be
> both an improvement over 2.4 and would correct 2.6 regressions vs. 2.4
> on underprovisioned machines. For now, we lack an implementation.

I doubt that you can get performance anywhere near 2.4 just by adding
load control to 2.6 unless you measure throughput and nothing else --
otherwise latency will kill you. I am convinced the key is not in
_adding_ stuff, but _fixing_ what we have.

IMO the question is: How much do we care? Machines with tight memory are
not necessarily very concerned about paging (e.g. PDAs), and serious
servers rarely operate under such conditions: Admins tend to add RAM
when the paging load is significant.

If you don't care _that_ much about thrashing in Linux, just tell
people to buy more RAM. Computers are cheap, RAM even more so, 64 bit
becomes affordable, and heavy paging sucks no matter how good a paging
mechanism is.

If you care enough to spend resources to address the problem, look at
the major regressions in 2.5 and find out where they were a consequence
of a deliberate trade-off decision and where it was an oversight which
can be fixed or mitigated without sacrificing what was gained through
the respective changes in 2.5. Obviously, performing regular testing
with thrashing benchmarks would make lasting major regressions like
those in the 2.5 development series much less likely in the future.

Additional load control mechanisms create new problems (latency,
increased complexity), so I think they should be a last resort, not
some method to paper over deficiencies elsewhere in the kernel.

Roger

Attachment: plot.png
Description: PNG image