Re: [PATCH 1/6] cpuset write dirty map

From: Andrew Morton
Date: Fri Sep 14 2007 - 19:16:24 EST


On Tue, 11 Sep 2007 18:36:34 -0700
Ethan Solomita <solo@xxxxxxxxxx> wrote:

> Add a dirty map to struct address_space

I get a tremendous number of rejects trying to wedge this stuff on top of
Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am
prepared to partially-fix so that I can usefully look at these changes in
tkdiff, so this is all based on a quick peek at the diff itself..

> In a NUMA system it is helpful to know where the dirty pages of a mapping
> are located. That way we will be able to implement writeout for applications
> that are constrained to a portion of the memory of the system as required by
> cpusets.
>
> This patch implements the management of dirty node maps for an address
> space through the following functions:
>
> cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes
>
> cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map
>
> cpuset_init_dirty_nodes(mapping) First time init of the map
>
>
> The dirty map may be stored either directly in the mapping (for NUMA
> systems with less then BITS_PER_LONG nodes) or separately allocated
> for systems with a large number of nodes (f.e. IA64 with 1024 nodes).
>
> Updating the dirty map may involve allocating it first for large
> configurations. Therefore we protect the allocation and setting
> of a node in the map through the tree_lock. The tree_lock is
> already taken when a page is dirtied so there is no additional
> locking overhead if we insert the updating of the nodemask there.
>
> The dirty map is only cleared (or freed) when the inode is cleared.
> At that point no pages are attached to the inode anymore and therefore it can
> be done without any locking. The dirty map therefore records all nodes that
> have been used for dirty pages by that inode until the inode is no longer
> used.
>

It'd be nice to see some discussion regarding the memory consumption of
this patch and the associated tradeoffs.


> ...
>
> +#if MAX_NUMNODES <= BITS_PER_LONG

The patch is sprinkled full of this conditional.

I don't understand why this is being done. afaict it isn't described
in a code comment (it should be) nor even in the changelogs?

Given its overall complexity and its likelihood to change in the
future, I'd suggest that this conditional be centralised in a single
place. Something like

/*
* nice comment goes here
*/
#if MAX_NUMNODES <= BITS_PER_LONG
#define CPUSET_DIRTY_LIMITS 1
#else
#define CPUSET_DIRTY_LIMITS 0
#endif

Then use #if CPUSET_DIRTY_LIMITS everywhere else.

(This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
warning if someone typos '#if CPUSET_DITRY_LIMITS')

Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig,
but I suspect you'll need to jump through unfeasible hoops to do that
sort of calculation within Kconfig.


> --- 0/include/linux/fs.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.000000000 -0700
> @@ -516,6 +516,13 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> +#ifdef CONFIG_CPUSETS
> +#if MAX_NUMNODES <= BITS_PER_LONG
> + nodemask_t dirty_nodes; /* nodes with dirty pages */
> +#else
> + nodemask_t *dirty_nodes; /* pointer to map if dirty */
> +#endif
> +#endif

afacit there is no code comment and no changelog text which explains the
above design decision? There should be, please.

There is talk of making cpusets available with CONFIG_SMP=n. Will this new
feature be available in that case? (it should be).

> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 1/include/linux/writeback.h
> --- 0/include/linux/writeback.h 2007-09-11 14:35:58.000000000 -0700
> +++ 1/include/linux/writeback.h 2007-09-11 14:37:46.000000000 -0700
> @@ -62,6 +62,7 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> void *fs_private; /* For use by ->writepages() */
> + nodemask_t *nodes; /* Set of nodes of interest */
> };

That comment is a bit terse. It's always good to be lavish when commenting
data structures, for understanding those is key to understanding a design.

> /*
> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
> --- 0/kernel/cpuset.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/kernel/cpuset.c 2007-09-11 14:36:24.000000000 -0700
> @@ -4,7 +4,7 @@
> * Processor and Memory placement constraints for sets of tasks.
> *
> * Copyright (C) 2003 BULL SA.
> - * Copyright (C) 2004-2006 Silicon Graphics, Inc.
> + * Copyright (C) 2004-2007 Silicon Graphics, Inc.
> * Copyright (C) 2006 Google, Inc
> *
> * Portions derived from Patrick Mochel's sysfs code.
> @@ -14,6 +14,7 @@
> * 2003-10-22 Updates by Stephen Hemminger.
> * 2004 May-July Rework by Paul Jackson.
> * 2006 Rework by Paul Menage to use generic containers
> + * 2007 Cpuset writeback by Christoph Lameter.
> *
> * This file is subject to the terms and conditions of the GNU General Public
> * License. See the file COPYING in the main directory of the Linux
> @@ -1754,6 +1755,63 @@ int cpuset_mem_spread_node(void)
> }
> EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
>
> +#if MAX_NUMNODES > BITS_PER_LONG

waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"

> +
> +/*
> + * Special functions for NUMA systems with a large number of nodes.
> + * The nodemask is pointed to from the address space structures.
> + * The attachment of the dirty_node mask is protected by the
> + * tree_lock. The nodemask is freed only when the inode is cleared
> + * (and therefore unused, thus no locking necessary).
> + */

hmm, OK, there's a hint as to wghat's going on.

It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
we might want to tweak that in the future. Yet another argument for
centralising this comparison.

> +void cpuset_update_dirty_nodes(struct address_space *mapping,
> + struct page *page)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> + int node = page_to_nid(page);
> +
> + if (!nodes) {
> + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);

Does it have to be atomic? atomic is weak and can fail.

If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
should at least pass the gfp_t into this function so it can do the stronger
allocation when possible.


> + if (!nodes)
> + return;
> +
> + *nodes = NODE_MASK_NONE;
> + mapping->dirty_nodes = nodes;
> + }
> +
> + if (!node_isset(node, *nodes))
> + node_set(node, *nodes);
> +}
> +
> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> +{
> + nodemask_t *nodes = mapping->dirty_nodes;
> +
> + if (nodes) {
> + mapping->dirty_nodes = NULL;
> + kfree(nodes);
> + }
> +}

Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
a comment which describes the locking requirements would be good.

> +/*
> + * Called without the tree_lock. The nodemask is only freed when the inode
> + * is cleared and therefore this is safe.
> + */
> +int cpuset_intersects_dirty_nodes(struct address_space *mapping,
> + nodemask_t *mask)
> +{
> + nodemask_t *dirty_nodes = mapping->dirty_nodes;
> +
> + if (!mask)
> + return 1;
> +
> + if (!dirty_nodes)
> + return 0;
> +
> + return nodes_intersects(*dirty_nodes, *mask);
> +}
> +#endif
> +
> /**
> * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
> * @p: pointer to task_struct of some other task.
> diff -uprN -X 0/Documentation/dontdiff 0/mm/page-writeback.c 1/mm/page-writeback.c
> --- 0/mm/page-writeback.c 2007-09-11 14:35:58.000000000 -0700
> +++ 1/mm/page-writeback.c 2007-09-11 14:36:24.000000000 -0700
> @@ -33,6 +33,7 @@
> #include <linux/syscalls.h>
> #include <linux/buffer_head.h>
> #include <linux/pagevec.h>
> +#include <linux/cpuset.h>
>
> /*
> * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -832,6 +833,7 @@ int __set_page_dirty_nobuffers(struct pa
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> + cpuset_update_dirty_nodes(mapping, page);
> write_unlock_irq(&mapping->tree_lock);
> if (mapping->host) {
> /* !PageAnon && !swapper_space */
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/