Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
From: Tejun Heo
Date: Wed Jun 17 2015 - 14:52:54 EST
On Tue, Jun 16, 2015 at 11:15:40PM -0400, Theodore Ts'o wrote:
> Hmm, while we're at it, there's another priority inversion that can be
> painful. If a block directory has been pushed out of memory (possibly
> because it was initially accessed by a cgroup with a very tiny amount
> of memory allocated to its cgroup) and a process with a cgroup tries
At scale, this is self-correcting to certain extent in that if the
inode is actually something shared across cgroups, it'll most likely
end up in a cgroup which has enough resource to keep it in memory.
This doesn't prevent one-off hiccups but it at least shouldn't develop
into a systematic and chronic issue.
> to do a lookup in that directory, it will issue the read with such a
> tightly constrained disk time that it might take minutes for the read
> to complete. The problem is that the VFS has locked the directory's
> i_mutex *before* calling ext4_lookup().
> If a high priority process then tries to read the same directory, or
> in fact any VFS operation which requires taking the directory's
> i_mutex first, including renaming the directory, the high priority
> process will end up blocking until the read is completed --- which can
> be minutes if the low priority process has a tiny amount of disk time
> allocated to it.
> There is a related problem where if a read for a particular block is
> issued with a very low amount of amount of disk time, and that same
> block is required by a high priority process, we can also get hit with
> a very similar priority inversion problem.
> To date the answer has always been, "Doctor, Doctor it hurts when I do
> that...." The only way I can think of fixing the directory mutex
In a lot of use cases, the directories accessed by different cgroups
are fairly segregated so this hopefully shouldn't happen too often but
yeah it can be painful on sharing cases.
> problem is by returning an error code to the VFS layer which instructs
> it to unlock the directory, and then have it wait on some wait channel
> so it ends up calling the lookup after the directory block has been
> read into memory (and we can hope that due to a tight memory cgroup
> the block doesn't end up getting ejected from memory right away).
> As another solution for another part of the problem, if a high
> priority process attempts a read and the I/O is already queued up, but
> it's at the back of the bus because it was originally posted by a low
> priority cgroup, the rest of the fix would be to elevate the priority
> of said I/O request and then resort the queue.
> As far as the filemap_fdatawait() call is concerned, if it's being
> called by fsync() run by a low priority process, or from the writeback
> thread, then it can certainly take place at a low prority. But if the
> filemap_fdatawait() is being done by a high priority process, such as
> a jbd/jbd2 thread, then there needs to be a way that we can set a flag
> in the wbc structure indicating that the writes should be submitted as
> if it was issued from the kernel thread, and not based on who
> originally dirtied the page.
Hmmm... so, overriding things *before* an bio is issued shouldn't be
too difficult and as long as this sort of operations aren't prevalent
we might be able to get away with just charging them against root.
Especially if it's to avoid getting blocked on the journal which we
already consider a shared overhead which is charged to root. If this
becomes large enough to require exacting charges, it'll be more
complex but still way better than trying to raise priority on a bio
which is already issued, which is likely to be excruciatingly painful
if possible at all.
> It's going to be a number of point solutions, which is a bit ugly, but
> I think that is much more likely to be successful than trying to
> implement, say, a generalized priority inheritance scheme for block
> I/O requests and related locks. :-)
I agree that generalized priority inheritance mechanism would be a
massive overkill. I think as long as we can avoid boosting bio's
which already have been issued, things should be relatively sane.
Hopefully, we'd be able to figure out solutions for the worst
offenders within these constraints.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/