Re: structure dentry help...

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Tue, 2 Nov 1999 18:00:14 +0100


Alexander Viro wrote:
> Aaaargh. Updatedb is only one of the tree scanners and usual sequence of
> cron jobs involves, for example, scan for suid binaries + several more
> things. Sorry, it's a long-standing pet peeve - never got to sit and
> change updatedb so that it would store mode. That would allow to cut down
> on the other scans. Big way.

My long standing pet peeve is that there's no efficient way to ask "has
this directory / file changed since I last asked?".

That would make all manner of scans really fast, especially recursive
makes and updatedb.

> > Optimisation #2: the "leaf" optimisation (GNU & BSD)
> >
> > Unix filesystems follow this convention: the link count of a
> > directory (that is not itself hard linked) is two plus the number
> > of subdirectories. The count works like this: one for the link
> > from its parent -- i.e. the name of the directory; one for the
> > entry "." in the directory itself, and one each for every ".."
> > entry in subdirectories of the directory because they refer back to
> > this directory.
>
> AFFS fscks that up. The thing has no i_nlink. By design ;-<

That's ok. Lots of filesystems have no i_nlink. As long as they return
1 or even 0, they don't fsck this at all.

BTW, the d_type optimisation is better than the i_nlink one anyway, and
appears to be possible for AFFS though I didn't implement it. AFFS is
too obscure for me to test...

> > This saves about 2/3 of all stat() calls. stat() calls are
> > expensive because they typically involve reading "inodes" from
> > disk for every file statted. Avoiding stat() calls means those
> > disk accesses are avoided as well.
>
> I would beg to differ. Number of stat(2) calls is not a sensible metrics
> here: you have several inodes in a block and stat() on any of them brings
> the whole bunch into core. IOW, if inumbers are scattered you are not
> winning much.

Indeed, but I did not have a way to measure number of block accesses.
(I still haven't looked at the recent tools for this).

Ext2 has interesting inumber properties: the non-directory inumbers are
clustered, but the directory inumbers are scattered.

When inumbers are scattered, you win by reducing seek times. There are a
variety of strategies for reducing seek times; sorting by inumber is one
(not necessarily the best, and not brilliant by itself in most cases).

Because of this, you may be right (for ext2). OTOH, the leaf
optimisation's greatest effect is with directories that contain no
subdirectories at all, or nearly none at all. So whole clusters do tend
to be skipped.

> > You can do even better in more cases if you know the type of an entry
> > before reading its inode. This affects not just eliding stat() calls,
> > but also O_DIRECTORY, and intermediate path name lookups. BSD has had
>
> Big win for 4.4BSD, but we have dcache... Again, it's a nice idea, but
> I'ld like to see some data.

d_type is not to do with caching: it's to do with avoiding stat()
calls. The dcache does not help with this.

Until d_type is implemented and I have a large, realistic ext2
filesystem with the TYPE feature enabled I can't give you data...

Last time I check, enabling the TYPE feature requires a change hack to
mke2fs, and you can't do it later (say in tune2fs). I do not have any
spare disks to try this on ATM.

> When you are "reading inodes" you are doing lookups too. And that
> _can_ be eliminated (from O(n^2) to O(n)).

Yes it can. One way is dentries without inodes ;-)

> > I found nearly half the total execution time is removed if I scan
> > everything except the newsspool directory, so the unusual directory
> > structure is significant. The subjective 50/50 split still applies.
>
> Sure it is. It's a helluva fat directory, so effects of search will play
> here. Big way. Notice that readdir() is O(n), but calling stat() on
> everything (especially when reordered) will give O(n^2) lookups and O(n)
> inode reads,

2.19user 13.00system 3:56.59elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (79major+1778minor)pagefaults 0swaps

CPU time used: 6%. Lookup is not very significant in that test, but
it's noticable. I think most of the time is going because I don't
properly handle the case where the kernel can't keep enough inodes /
directory buffers in memory.

I will eventually handle the inode problem by tracking them explicitly,
but that makes the optimisation algorithm quite tricky. I can't do
anything about the directory buffers; I just have to hope they do not
keep flushing unexpectedly during a scan.

BTW, will the recent dynamic table changes help with this, by making it
probable that the inode/dentry tables won't flush during a scan? (There
are about 10,000 directories and 200,000 files in that test, on a 64M
machine).

> and I don't see how to distinguish between reading inode and
> doing lookup - both happen in one syscall.

Reading inode should not use much CPU time: it blocks for I/O. Lookup
time is 100% CPU.

> You forgot about the memory pressure _and_ about the fact that negative
> dentries are used as templates for create() and friends.

Oops, slipped ;-)

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/