Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state

From: Steven Whitehouse
Date: Tue Jun 21 2011 - 05:58:37 EST


Hi,

On Mon, 2011-06-20 at 16:09 -0400, Christoph Hellwig wrote:
> On Sun, Jun 19, 2011 at 11:01:15PM +0800, Wu Fengguang wrote:
> > When there are only one (or several) dirtiers, dirty_exceeded is always
> > (or mostly) off. Converting to timestamp avoids this problem. It helps
> > to use smaller write_chunk for smoother throttling.
>
> In current mainline gfs2 has grown a non-trivial reference to
> backing_dev_info.dirty_exceeded, which needs to be dealt with.
>
So let me try and explain whats going on there... the basic issue is
that writeback is done on a per-inode basis, but pages are accounted for
on a per-address space basis.

In GFS2, glocks referring to inodes and rgrps (resource groups) both
have an address space associated with them. These address spaces contain
the metadata that would normally be in the block device address space,
but have been separated so that we can sync and/or invalidate metadata
easily on a per-inode basis. Note that we have the additional
requirement to be able to track clean metadata, so that the existing
per-inode list of dirty metadata doesn't work for GFS2. Due to the
lifetime rules for the glocks, and the lack of an inode for rgrps, the
mapping->host for the glock address spaces has to point at the block
device inode.

Now in the normal inode case, that isn't a problem - writeback calls
->write_inode which can then write out the dirty metadata pages (if
any). The issue we've hit has been with rgrps and in particular if the
total dirty data associated with rgrps exceeds the per-bdi dirty limit.

In that case we found that writeback was spinning without making any
progress since it was trying to writeback inodes (all by that stage
clean) and it didn't have any way to start writeback on rgrps. So the
simplest solution was to check the dirty exceeded flag during inode
writeback, and if set try writing back more data than actually requested
via the ail lists. This list contains all the dirty metadata, so it
includes the rgrps too. Due to the way in which rgrps are used, it is
impossible to dirty one without also dirtying at least one inode.

In addition to that, the ordering of data blocks on the ail list is
often more optimal (especially for workloads with lots of small files)
and we get a performance improvement by doing writeback that way too.

Having said that, I know its not ideal, and I'm open to any suggestions
for better solutions,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/