Re: JFS default behavior / UTF-8 filenames

From: kernel
Date: Thu Feb 19 2004 - 05:59:49 EST


So then, just about everyone agrees that if you've got a filename with
non-ASCII characters, you should pass it to creat() as UTF-8. You have
to pass it as something, individual encodings like BIG5 and EUC-JP
are unacceptable, and UCS-4's benefits over UTF-8 (simplicity and in
VERY rare cases storage size reductions) aren't worth the stuff it
breaks. Correct?

As I see it, there's no way for the kernel to deal with all the legacy
filenames out there. There's no way the kernel can magically fix them.

So the only thing the kernel could do for those who want to see valid
unicode is have an option to make UTF-8 only filesystems. Best would be
if it was done at mkfs time and always enforced from then on in so that
a non-UTF8 filename can never be created. Because if you want the kernel
to not pass non-UTF8 filenames back to userspace, the ONLY clean way to
do that is to make sure they're not there in the first place. You could
maybe try it with a mount=utf8only flag, but the only thing that could
do then would be to make the files with invalid filenames "disappear".

For filesystems like JFS and NTFS, I think this is the best way in the
long run, have the kernel output as UTF-8 by default, assume UTF-8
inputs, and reject non-UTF8 filenames because they can't really store
the arbitrary string of bytes model anyway.

For others which can, maybe leave it up to the filesystem creator
whether to reject non-UTF8 filenames or to accept invalid ones as well?
Either way, a well-written userspace app shouldn't barf on recieving
invalid UTF-8 from the kernel, we'll have legacy filenames around for a
good long time yet, and it's the only way to be portable to older
linuxes and other UNIXes where you definatly would not be guaranteed
valid UTF-8 no matter what new linux kernels decide.

In any case, the important part is to make sure userspace stops writing
filenames in BIG5 as soon as possible. I don't know if this can be done
nicely in libc, with libc automagically transforming the BIG5 filename
in open() to UTF-8 and the UTF-8 in readdir() to BIG5 based on the
locale, or if we have to rely on every userspace app to store filenames
in UTF-8 by themselves. But that's a decision for the glibc guys. It
doesn't affect that filenames need to start being written to the
filesystem in UTF-8 rather than other encodings, and that the only
decision the kernel has to make is whether or not to reject attempts to
create filenames which are invalid UTF-8.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/