Re: structure dentry help...

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Tue, 2 Nov 1999 16:50:50 +0100


Alexander Viro wrote:
> > My program reads only about 5% (my guesstimate) of the inodes. The rest
> > are elided by the "leaf optimisation" (also in GNU find) or, in my
> > program, a heuristic-guided variant which finds equivalent solutions
> > with fewer inode reads.
>
> Details, please. In almost all cases when find(1) is applied to
> really large trees it involves mode conditions (either -perm or -t), i.e.
> lstat(2).

The biggest case of all is `updatedb', run on many Linux systems every
night.

There is `find . -name PATTERN2 -o -name PATTERN2 -print0 | xargs -0 rm
-f`, which you will remember from building Linux kernels...

I generate all my kernel patches using a search based on names. cf. Red
Hat's `gendiff' script.

Many programs bring up file selectors. File selectors show (a)
everything matching a pattern, (b) everything that is a directory.

These can all be optimised quite thoroughly...

You are right that `-type' is quite commonly used. Depending on which
type you're testing, you may be able to answer the question (or provide
a partial answer, i.e. only 2 of "yes", "no" and "don't know"). `-type
d' is a fine example, as is `! -type d'...

> What cases you are optimizing? And how are you doing it,
> BTW?

Many optimisations, starting from the ones present in GNU & BSD finds.
I have begun writing a document describing which you'll find in the next
version of treescan. However, the file is called ALGORITHM and I am
still designing the strategy algorithm, which is rather an interesting
problem, so the release may take some time :-)

Here is a relevant extract. I see my guesstimate of 95% was wrong; the
figure of 2/3 below was determined experimentally. However it will get
larger when I add the name-based heuristic:

Optimisation #2: the "leaf" optimisation (GNU & BSD)

Unix filesystems follow this convention: the link count of a
directory (that is not itself hard linked) is two plus the number
of subdirectories. The count works like this: one for the link
from its parent -- i.e. the name of the directory; one for the
entry "." in the directory itself, and one each for every ".."
entry in subdirectories of the directory because they refer back to
this directory.

Directories that are hard-linked have a higher link count (this is
rare, and many systems don't allow it at all).

So if you know the link count is two, you immediately know the
directory contains no subdirectories, and there is no need to call
stat() in step 2b just to find this out. If the link count is
higher, you only have to call stat() enough times until you've
found all the subdirectories; then you know the remaining entries
are not subdirectories.

This saves about 2/3 of all stat() calls. stat() calls are
expensive because they typically involve reading "inodes" from
disk for every file statted. Avoiding stat() calls means those
disk accesses are avoided as well.

I do not have a good description for the name-driven heuristic, so here
is a summary: you can guess in advance which entries are directories and
test them first. With the leaf optimisations, you are more likely to be
able to eliminate stat() calls. On certain directory structures, e.g.
especially news spools (even my local leafnode cache), it is easy to
guess well and still get it right even if you guess wrong.

> > [ btw, did you see my patch for adding d_type behaviour to the readdir()
> > callback for all filesystems? ]
>
> Yes. I'ld still like to understand _what_ are you doing before commenting
> on that.

You can do even better in more cases if you know the type of an entry
before reading its inode. This affects not just eliding stat() calls,
but also O_DIRECTORY, and intermediate path name lookups. BSD has had
this for a long time, and GNU Libc is waiting for the kernel to
implement it:

Optimisation #3: the "dt_type" optimisation (BSD only)

Reading a directory on Unix operating systems returns a list of
names in the directory, and for each one, an inode number. We'll
ignore the inode number for the moment. To get more information
about an entry, you call stat(), which reads the inode.

One day, someone noticed that the type of the thing pointed to by a
directory entry doesn't change during the lifetime of the entry*,
so it's quite efficient to store the type along with the name, in
the directory itself. BSD uses this: reading a directory on BSD
returns the type along with the name and inode for each entry.

[ * very old Unix systems did allow the type to change ]

When this information is available, step 2b can be skipped
entirely: no need to call stat() just to decide which entries are
subdirectories.

GNU Libc as used on GNU/Linux also has the dt_type field, but as of
Glibc 2.1 and Linux kernel 2.3.11, the type returned is always
"unknown", and GNU Find does not try to use this information.

> Are you sure that inode reads really dominate here? I would
> expect large and nasty directory fiddling (quadratic by size) to take quite
> some time...

Most filesystems do not have any large directories, so directory search
time is negligable. Unless everything is in memory already, the
dominant time is I/O. Some CPU time is required for the sorting and
general strategic things, but that will improve.

For a full disk scan of a 2/3 full 6.4G disk, which includes about 100
megabytes of local newsspool cache, reading inodes takes about half of
the time and reading directories takes the other half. (This is a very
rough subjective measurement based on watching messages appear between
alternating phases).

I found nearly half the total execution time is removed if I scan
everything except the newsspool directory, so the unusual directory
structure is significant. The subjective 50/50 split still applies.

The 2/3 stat() measurement was taken on another disk, with most of its
3G or so used, that does not have a local newsspool cache, and in
neither case have I used the name heuristic yet.

> BTW, lookup() is protected by lock on the parent, so that might be the
> reason for contention.

Another reason for "dentry without inode"? Can you avoid locking the
parent if you only have to do an iget()?

An aside:

At some point is possible that Make could be optimised using these
strategies. Make normally does a lot of stat() calls to try out
implicit rules. However if it used a decent strategy for collecting
filenames using readdir() and limiting the stat() calls, that might be a
win.

Alternatively the kernel could provide the same. If readdir() can prime
the dentry cache, then if a name is not in the dentry cache and the
directory has not changed, there is no need for a lookup()...

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/