Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

From: Wu Fengguang
Date: Sat Feb 20 2010 - 21:28:42 EST


Hi Balbir,

> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> > zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> > zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> > zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> > zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> > zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> > zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> > zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> > zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> > zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> > zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
>
> It would be nice to see mapped/unmapped information as well.

As you noticed, we have mapcount for individual pages :)

> > +static int pages_similiar(struct page* page0, struct page* page)
> > +{
> > + if (page_count(page0) != page_count(page))
> > + return 0;
> > +
> > + if (page_mapcount(page0) != page_mapcount(page))
> > + return 0;
> > +
> > + if (page_flags(page0) != page_flags(page))
> > + return 0;
> > +
> > + return 1;
> > +}
> > +
>
> OK, so pages_similar() is used to identify a range of pages in the
> cache?

Right. Many files are accessed sequentially or clustered, so
pages_similar() can save lots of output lines :)

> > +#define BATCH_LINES 100
> > +static void dump_pagecache(struct address_space *mapping)
> > +{
> > + int i;
> > + int lines = 0;
> > + pgoff_t len = 0;
> > + struct pagevec pvec;
> > + struct page *page;
> > + struct page *page0 = NULL;
> > + unsigned long start = 0;
> > +
> > + for (;;) {
> > + pagevec_init(&pvec, 0);
> > + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> > + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
>
> Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
> call it under RCU or a lock (mapping) ?

No. This function is inherently non-atomic, and it seems that most in-kernel
users do not bother to take rcu_read_lock(). So lets leave it as is?

> > +static ssize_t
> > +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> > + loff_t *ppos)
> > +{
> > + struct file *file = NULL;
> > + char *name;
> > + int err = 0;
> > +
>
> Can't we use the trace_parser here?

Seems not necessary? It's merely one file name, which could contain spaces.

> > + if (count <= 1)
> > + return -EINVAL;
> > + if (count > PATH_MAX + 1)
> > + return -ENAMETOOLONG;
> > +
> > + name = kmalloc(count+1, GFP_KERNEL);
> > + if (!name)
> > + return -ENOMEM;
> > +
> > + if (copy_from_user(name, ubuf, count)) {
> > + err = -EFAULT;
> > + goto out;
> > + }
> > +
> > + /* strip the newline added by `echo` */
> > + if (name[count-1] != '\n')
> > + return -EINVAL;
>
> Doesn't sound correct, what happens if we use echo -n?

It's a bit sad. If we accept both "echo" and "echo -n" with some
smart logic to test for trailing '\n', then it will go wrong for a
'\n'-terminated file name.

Or shall we support only "echo -n"? I can do with either one.

> > --- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
> > +++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
> > @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> > inode->i_bdev = NULL;
> > inode->i_cdev = NULL;
> > inode->i_rdev = 0;
> > - inode->dirtied_when = 0;
> > + inode->dirtied_when = jiffies;
> >
>
> Hmmm... Is the inode really dirtied when initialized? I know the
> change is for tracing, but the code when read is confusing.

Huh. Not really dirtied (for that you need to check I_DIRTY), but
dirtied_when is only used in writeback code when I_DIRTY is set.

So I overload dirtied_when in the clean case to indicate the inode
load time. This is a useful trick for fastboot to collect cache
footprint shortly after boot, when most inodes are clean.

It does ask for a comment:

/*
* This records inode load time. It will be invalidated once inode is
* dirtied, or jiffies wraps around. Despite the pitfalls it still
* provides useful information for some use cases like fastboot.
*/
inode->dirtied_when = jiffies;


Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/