On Wed 26-03-14 22:55:18, Andres Freund wrote:
On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:That's not necessary. If we have a guidance like above, we can figure it
On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@xxxxxxxxxxx> wrote:Also, unless you changed the parameters, it's a) using a 48GB disk file,
Hi,For your amusement: running this program in KVM on a 2GB disk image
At LSF/MM there was a slot about postgres' problems with the kernel. Our
top#1 concern is frequent slow read()s that happen while another process
calls fsync(), even though we'd be perfectly fine if that fsync() took
ages.
The "conclusion" of that part was that it'd be very useful to have a
demonstration of the problem without needing a full blown postgres
setup. I've quickly hacked something together, that seems to show the
problem nicely.
For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
and the "IO Scheduling" bit in
http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
failed, but it caused the *host* to go out to lunch for several
seconds while failing. In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out. The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.
and writes really rather fast ;)
Even using ext4 is no good. I think that dm-crypt is dying under theTry to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
load. So I won't test your program for real :/
smaller. If it still doesn't work consider increasing the two nsleep()s...
I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.
out ourselves (I hope ;).
There is. The association is lost for background writeback (and sync(2)I think it's both actually. If I understand correctly there's not even aPossible solutions:I thought the problem wasn't so much that priorities weren't respected
* Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
sync_file_range() does.
* Make IO triggered by writeback regard IO priorities and add it to
schedulers other than CFQ
* Add a tunable that allows limiting the amount of dirty memory before
writeback on a per process basis.
* ...?
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.
correct association to the originator anymore during a fsync triggered
flush?
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.
What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.
I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).