Re: Disabling in-memory write cache for x86-64 in Linux II

From: Rob Landley
Date: Tue Nov 19 2013 - 22:16:41 EST


On 10/30/2013 07:01:52 AM, Mel Gorman wrote:
We talked about this a
few months ago but I still suspect that we will have to bite the bullet and
tune based on "do not dirty more data than it takes N seconds to writeback"
using per-bdi writeback estimations. It's just not that trivial to implement
as the writeback speeds can change for a variety of reasons (multiple IO
sources, random vs sequential etc).

Record "block writes finished this second" into an 8 entry ring buffer, with a flag saying "device was partly idle this period" so you can ignore those entries. Keep a high water mark, which should converge to the device's linear write capacity.

This gives you recent thrashing speed and max capacity, and some weighted average of the two lets you avoid queuing up 10 minutes of writes all at once like 3.0 would to a terabyte USB2 disk. (And then vim calls sync() and hangs...)

The first tricky bit is the high water mark, but it's not too bad. If the device reads and writes at the same rate you can populate it from that, but even starting it with just one block should converge really fast because A) the round trip time should be well under a second, B) if you're submitting more than one period's worth of data (you can dirty enough to keep disk busy for 2 seconds), then it'll queue up 2 blocks at a time, then 4, then 8, and increase exponentially until you hit the high water mark. (Which is measured so it won't overshoot.)

The second tricky bit is weighting the average, but presumably counting the high water mark as one, then adding in all the "device did not actually go idle during this period" entries, and dividing by the number of entries considered... Reasonable first guess?

Obvious optimizations: instead of recording the "disk went idle" flag in the ring buffer, just don't advance the ring buffer at the end of that second, but zero out the entry and re-accumulate it. That way the ring buffer should always have 7 seconds of measured activity, even if it's not necessarily recent. And of course you don't have to wake anything up when there was no I/O, so it's nicely quiescent when the system is...

Lowering the high water mark in the case of a transient spurious reading (maybe clock skew during suspend or virtualization glitch or some such) is fun, and could give you a 4 billion block bad reading, but if you always decrement the high water mark by 25% (x-=(x>>2)) each second the disk didn't go idle (rounding up) and then queue up more than one period's worth of data (but no more than say 8 seconds worth), such glitches should fix themselves and it'll work its way back up or down to a reasonably accurate value. (Keep in mind you're averaging the high water mark back down with 7 seconds of measured data from the ring buffer. Maybe you can cap the high water mark at the sum of all the measured values in the ring buffer as an extra check? You're already calculating it to do the average, so...)

This is assuming your hard drive _itself_ doesn't have bufferbloat, but http://spritesmods.com/?art=hddhack&f=rss implies they don't, and tagged command queueing lets you see through that anyway so your "actually committed" numbers could presumably still be accurate if the manufacturers aren't totally lying.

Given how far behind I am on my email, I assume somebody's already suggested this by now. :)

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/