Re: Disabling in-memory write cache for x86-64 in Linux II
From: Rob Landley
Date: Tue Nov 19 2013 - 22:16:41 EST
On 10/30/2013 07:01:52 AM, Mel Gorman wrote:
We talked about this a
few months ago but I still suspect that we will have to bite the
bullet and
tune based on "do not dirty more data than it takes N seconds to
writeback"
using per-bdi writeback estimations. It's just not that trivial to
implement
as the writeback speeds can change for a variety of reasons (multiple
IO
sources, random vs sequential etc).
Record "block writes finished this second" into an 8 entry ring buffer,
with a flag saying "device was partly idle this period" so you can
ignore those entries. Keep a high water mark, which should converge to
the device's linear write capacity.
This gives you recent thrashing speed and max capacity, and some
weighted average of the two lets you avoid queuing up 10 minutes of
writes all at once like 3.0 would to a terabyte USB2 disk. (And then
vim calls sync() and hangs...)
The first tricky bit is the high water mark, but it's not too bad. If
the device reads and writes at the same rate you can populate it from
that, but even starting it with just one block should converge really
fast because A) the round trip time should be well under a second, B)
if you're submitting more than one period's worth of data (you can
dirty enough to keep disk busy for 2 seconds), then it'll queue up 2
blocks at a time, then 4, then 8, and increase exponentially until you
hit the high water mark. (Which is measured so it won't overshoot.)
The second tricky bit is weighting the average, but presumably counting
the high water mark as one, then adding in all the "device did not
actually go idle during this period" entries, and dividing by the
number of entries considered... Reasonable first guess?
Obvious optimizations: instead of recording the "disk went idle" flag
in the ring buffer, just don't advance the ring buffer at the end of
that second, but zero out the entry and re-accumulate it. That way the
ring buffer should always have 7 seconds of measured activity, even if
it's not necessarily recent. And of course you don't have to wake
anything up when there was no I/O, so it's nicely quiescent when the
system is...
Lowering the high water mark in the case of a transient spurious
reading (maybe clock skew during suspend or virtualization glitch or
some such) is fun, and could give you a 4 billion block bad reading,
but if you always decrement the high water mark by 25% (x-=(x>>2)) each
second the disk didn't go idle (rounding up) and then queue up more
than one period's worth of data (but no more than say 8 seconds worth),
such glitches should fix themselves and it'll work its way back up or
down to a reasonably accurate value. (Keep in mind you're averaging the
high water mark back down with 7 seconds of measured data from the ring
buffer. Maybe you can cap the high water mark at the sum of all the
measured values in the ring buffer as an extra check? You're already
calculating it to do the average, so...)
This is assuming your hard drive _itself_ doesn't have bufferbloat, but
http://spritesmods.com/?art=hddhack&f=rss implies they don't, and
tagged command queueing lets you see through that anyway so your
"actually committed" numbers could presumably still be accurate if the
manufacturers aren't totally lying.
Given how far behind I am on my email, I assume somebody's already
suggested this by now. :)
Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/