Re: [PATCH 3/5] writeback: stop background/kupdate works fromlivelocking other works

From: Jan Kara
Date: Tue Nov 09 2010 - 18:56:41 EST


On Tue 09-11-10 15:00:06, Andrew Morton wrote:
> On Tue, 9 Nov 2010 23:28:27 +0100
> Jan Kara <jack@xxxxxxx> wrote:
> > New description which should address above questions:
> > Background writeback is easily livelockable in a loop in wb_writeback() by
> > a process continuously re-dirtying pages (or continuously appending to a
> > file). This is in fact intended as the target of background writeback is to
> > write dirty pages it can find as long as we are over
> > dirty_background_threshold.
>
> Well. The objective of the kupdate function is utterly different.
>
> > But the above behavior gets inconvenient at times because no other work
> > queued in the flusher thread's queue gets processed. In particular,
> > since e.g. sync(1) relies on flusher thread to do all the IO for it,
>
> That's fixable by doing the work synchronously within sync_inodes_sb(),
> rather than twiddling thumbs wasting a thread resource while waiting
> for kernel threads to do it. As an added bonus, this even makes cpu
> time accounting more accurate ;)
>
> Please remind me why we decided to hand the sync_inodes_sb() work off
> to other threads?
Because when sync(1) does IO on it's own, it competes for the device with
the flusher thread running in parallel thus resulting in more seeks.

> > sync(1) can hang forever waiting for flusher thread to do the work.
> >
> > Generally, when a flusher thread has some work queued, someone submitted
> > the work to achieve a goal more specific than what background writeback
> > does. Moreover by working on the specific work, we also reduce amount of
> > dirty pages which is exactly the target of background writeout. So it makes
> > sense to give specific work a priority over a generic page cleaning.
> >
> > Thus we interrupt background writeback if there is some other work to do. We
> > return to the background writeback after completing all the queued work.
> >
...
> > > So... what prevents higher priority works (eg, sync(1)) from
> > > livelocking or seriously retarding background or kudate writeout?
> > If other work than background or kupdate writeout livelocks, it's a bug
> > which should be fixed (either by setting sensible nr_to_write or by tagging
> > like we do it for WB_SYNC_ALL writeback). Of course, higher priority work
> > can be running when background or kupdate writeout would need to run as
> > well. But the idea here is that the purpose of background/kupdate types of
> > writeout is to get rid of dirty data and any type of writeout does this so
> > working on it we also work on background/kupdate writeout only possibly
> > less efficiently.
>
> The kupdate function is a data-integrity/quality-of-service sort of
> thing.
>
> And what I'm asking is whether this change enables scenarios in which
> these threads can be kept so busy that the kupdate function gets
> interrupted so frequently that we can have dirty memory not being
> written back for arbitrarily long periods of time?
So let me compare:
What kupdate writeback does:
queue inodes older than dirty_expire_centisecs
while some inode in the queue
write MAX_WRITEBACK_PAGES from each inode queued
break if nr_to_write <= 0

What any other WB_SYNC_NONE writeback (let me call it "normal WB_SYNC_NONE
writeback") does:
queue all dirty inodes
while some inode in the queue
write MAX_WRITEBACK_PAGES from each inode queued
break if nr_to_write <= 0


There only one kind of WB_SYNC_ALL writeback - the one which writes
everything.

So after WB_SYNC_ALL writeback (provided all livelocks are fixed ;)
obviously no old data should be unwritten in memory. Normal WB_SYNC_NONE
writeback differs from a kupdate one *only* in the fact that we queue all
inodes instead of only the old ones. We start writing old inodes first and
go inode by inode writing MAX_WRITEBACK_PAGES from each. Now because the
queue can be longer for normal WB_SYNC_NONE writeback, it can take longer
before we return to the old inodes. So if normal writeback interrupts
kupdate one, it can take longer before all data of old inodes get to disk.
But we always get the old data to disk - essentially at the same time at
which kupdate writeback would get them to disk if dirty_expire_centisecs
was 0.

Is this enough? Do you want any of this in the changelog?

Thanks for the inquiry btw. It made me cleanup my thoughts on the subject ;)

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/