shrink_dcache_sb scalability problem.

From: David Chinner
Date: Thu Apr 13 2006 - 04:22:11 EST


Folks,

After recently upgrading a build machine to 2.6.16, we started
seeing 10-50s pauses where the machine would appear to hang.
Profiles showed that we were spending a substantial amount of time
in shrink_dcache_sb, and several CPUs were spinning on the
dcache_lock.

This is happening quite frequently - we recorded a 10 minute period
where there were 13 incidents where a touch/rm of a single file was
taking longer than 10s. The machine was close to unusable when this
happened.

At the time of the problem the machine had several million unused
cached dentries in memory (often > 10million), and the builds use
chroot environments with internally mounted filesystems like /proc
and /sys.

The problem is that whenever we mount /proc, /sys, /dev/pts, etc, we
call shrink_dcache_sb() which does multiple traversals across the
unused dentry list with the dcache_lock held.

It is trivial to reduce this to one traversal for the case of a new
mount. However, that doesn't solve the issue that we are walking a
linked list of many million entries with a global lock held and
holding out everyone else.

We're open to any suggestions on how to go about fixing this problem
as it's not obvious what the correct way to approach this problem
is. Any advice, patches, etc is more than welcome.

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/