Re: huge filesystems

From: Andreas Dilger
Date: Tue Mar 15 2005 - 01:06:29 EST


On Mar 14, 2005 21:37 -0700, jmerkey wrote:
> 1. Scaling issues with readdir() with huge numbers of files (not even
> huge really. 87000 files in a dir takes a while
> for readdir() to return results). I average 2-3 million files per
> directory on 2.6.9. It can take a up to a minute for
> readdir() to return from initial reading from on of these directories
> with readdir() through the VFS.

Actually, unless I'm mistaken the problem is that "ls" (even when you
ask it not to sort entries) is doing readdir on the whole directory
before returning any results. We see this with Lustre and very large
directories. Run strace on "ls" and it is doing masses of readdirs, but
no output to stdout. Lustre readdir works OK on directories up to 10M
files, but ls sucks.

$ strace ls /usr/lib 2>&1 > /dev/null
:
:
open("/usr/lib", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=57344, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, /* 120 entries */, 4096) = 4096
getdents64(3, /* 65 entries */, 4096) = 2568
getdents64(3, /* 111 entries */, 4096) = 4088
:
:
getdents64(3, /* 59 entries */, 4096) = 2152
getdents64(3, /* 10 entries */, 4096) = 496
getdents64(3, /* 0 entries */, 4096) = 0
close(3) = 0
write(1, "Acrobat5\nalchemist\nanaconda\nanac"..., 4096) = 4096
write(1, "libbonobo-2.a\nlibbonobo-2.so\nlib"..., 4096) = 4096
write(1, "ibgdbm.so\nlibgdbm.so.2\nlibgdbm.s"..., 4096) = 4096
write(1, "nica_qmxxx.la\nlibgphoto_konica_q"..., 4096) = 4096
write(1, ".so\nlibIDL-2.so.0\nlibIDL-2.so.0."..., 4096) = 4096
write(1, "libkpilot.so.0\nlibkpilot.so.0.0."..., 4096) = 4096
write(1, "ove.so.0\nlibospgrove.so.0.0.0\nli"..., 4096) = 4096
write(1, ".6\nlibsoundserver_idl.la\nlibsoun"..., 4096) = 4096
write(1, "lparse.so.0\nlibxmlparse.so.0.1.0"..., 1294) = 1294


> 6. fdisk does not support drives arger than 2TB, so I have to hack the
> partition tables and fake out dsfs with 3TB abd 4TB drives created with
> RAID 0 controllers and hardware. This needs to get fixed.

Use a different partition format (e.g. EFI or devicemapper) or none at all.
That is better than just ignoring the whole thing and some user thinking
"gee, I have all this free space here, maybe I'll make another partition".

Cheers, Andreas
--
Andreas Dilger
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

Attachment: pgp00000.pgp
Description: PGP signature