What do you do when swap is faster than disk?

From: John Moser
Date: Mon Feb 27 2012 - 22:03:53 EST


I got into a weird situation today.

I limited my system RAM and used zram to make a swap device, then put up memory pressure. What happened was I got 50MB of disk cache going on and an -extremely- slow system with lots and lots and LOTS of hard disk activity.

I found that around 130MB the system was fine, and around 200MB it was extremely fast. I also found that it's extremely difficult to make the OS keep 200MB of disk cache around on 2GB of RAM when you have 600MB swapped out.

That raises some questions about if the tunables for this are adequate. I can't very well set vm.swappiness to 150, and even at 100 it's not really helpful. I mean when you have more than a quarter of your RAM in swap, the tunable seems to have almost no impact.

It's also come to my mind that the kernel could, possibly, attempt some speed testing against the devices and determine how fast they are. This could be as simple as worrying about it under pressure, splitting up swap among all swap devices and working out throughput and latency. Then prioritize them such that, under pressure, very old data in a very fast swap device gets swapped out to a slower swap device.

Disk cache could be ranked against the whole thing too to decide just how important disk cache is being--and how much time is being spent mucking about with re-loading flushed cache versus swapping. You could keep some information about what was in RAM before aside, and push it out as it gets old--a particular area of cache 40MB wide was flushed, there's a 16 byte structure somewhere in RAM that makes note of that. After more time has been spent flushing swap than it takes to read in that 40MB, drop it. If it's read back in, make note that too much disk cache flushing is happening and not enough swapping is going on.

This is more complex than it sounds, though, because you also have to consider reading things into cache. Eventually you have to invalidate disk cache to make room for more cache, after all. That or swap out even more. So yes I understand this is hard.

Just thought I'd mention that the problem seems to be more complex than it's credited for at the moment. (In my test case, simply locking 200MB for disk cache would have been fine... much better than what actually happened!)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/