Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

From: Greg Thelen
Date: Thu Mar 17 2011 - 00:42:35 EST

On Wed, Mar 16, 2011 at 2:52 PM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> On Wed, Mar 16, 2011 at 02:19:26PM -0700, Greg Thelen wrote:
>> On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
>> >> I think even for background we shall have to implement some kind of logic
>> >> where inodes are selected by traversing memcg->lru list so that for
>> >> background write we don't end up writting too many inodes from other
>> >> root group in an attempt to meet the low background ratio of memcg.
>> >>
>> >> So to me it boils down to coming up a new inode selection logic for
>> >> memcg which can be used both for background as well as foreground
>> >> writes. This will make sure we don't end up writting pages from the
>> >> inodes we don't want to.
>> >
>> > Originally for struct page_cgroup reduction, I had the idea of
>> > introducing something like
>> >
>> >        struct memcg_mapping {
>> >                struct address_space *mapping;
>> >                struct mem_cgroup *memcg;
>> >        };
>> >
>> > hanging off page->mapping to make memcg association no longer per-page
>> > and save the pc->memcg linkage (it's not completely per-inode either,
>> > multiple memcgs can still refer to a single inode).
>> >
>> > We could put these descriptors on a per-memcg list and write inodes
>> > from this list during memcg-writeback.
>> >
>> > We would have the option of extending this structure to contain hints
>> > as to which subrange of the inode is actually owned by the cgroup, to
>> > further narrow writeback to the right pages - iff shared big files
>> > become a problem.
>> >
>> > Does that sound feasible?
>> If I understand your memcg_mapping proposal, then each inode could
>> have a collection of memcg_mapping objects representing the set of
>> memcg that were charged for caching pages of the inode's data.  When a
>> new file page is charged to a memcg, then the inode's set of
>> memcg_mapping would be scanned to determine if current's memcg is
>> already in the memcg_mapping set.  If this is the first page for the
>> memcg within the inode, then a new memcg_mapping would be allocated
>> and attached to the inode.  The memcg_mapping may be reference counted
>> and would be deleted when the last inode page for a particular memcg
>> is uncharged.
> Dead-on.  Well, on which side you put the list - a per-memcg list of
> inodes, or a per-inode list of memcgs - really depends on which way
> you want to do the lookups.  But this is the idea, yes.
>>   page->mapping = &memcg_mapping
>>   inode->i_mapping = collection of memcg_mapping, grows/shrinks with [un]charge
> If the memcg_mapping list (or hash-table for quick find-or-create?)
> was to be on the inode side, I'd put it in struct address_space, since
> this is all about page cache, not so much an fs thing.
> Still, correct in general.

In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
Vivek have had some discussion around how memcg and writeback mesh.
In my mind, the
discussions in 8/9 are starting to blend with this thread.

I have been thinking about Johannes' struct memcg_mapping. I think the idea
may address several of the issues being discussed, especially
interaction between
IO-less balance_dirty_pages() and memcg writeback.

Here is my thinking. Feedback is most welcome!

The data structures:
- struct memcg_mapping {
struct address_space *mapping;
struct mem_cgroup *memcg;
int refcnt;
- each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
- each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
very large set representing many cached inodes.
- each memcg_mapping represents all pages within an bdi,inode,memcg. All
corresponding cached inode pages point to the same memcg_mapping via
pc->mapping. I assume that all pages of inode belong to no more than one bdi.
- manage a global list of memcg that are over their respective background dirty
- i_mapping continues to point to a traditional non-memcg mapping (no change
- none of these memcg_* structures affect root cgroup or kernels with memcg
configured out.

The routines under discussion:
- memcg charging a new inode page to a memcg: will use inode->mapping and inode
to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
levels in data structure.

- Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
Also delete memcg_bdi if last memcg_mapping is removed.

- account_page_dirtied(): nothing new here, continue to set the per-page flags
and increment the memcg per-cpu dirty page counter. Same goes for routines
that mark pages in writeback and clean states.

- mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
background limit, then add memcg to global memcg_over_bg_limit list and use
memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
(aka memcg_bdi) accounting structure.

- bdi writeback: will revert some of the mmotm memcg dirty limit changes to
fs-writeback.c so that wb_do_writeback() will return to checking
wb_check_background_flush() to check background limits and being
interruptible if
sync flush occurs. wb_check_background_flush() will check the global
memcg_over_bg_limit list for memcg that are over their dirty limit.
wb_writeback() will either (I am not sure):
a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
b) scan bdi dirty inode list (only some of them in memcg) using
inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
is caching pages from the inode.

- over_bground_thresh() will determine if memcg is still over bg limit.
If over limit, then it per bdi per memcg background flushing will continue.
If not over limit then memcg will be removed from memcg_over_bg_limit list.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at