Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: John Bradford
Date: Tue Feb 17 2004 - 16:29:36 EST


Quote from Linus Torvalds <torvalds@xxxxxxxx>:
>
>
> On Tue, 17 Feb 2004, John Bradford wrote:
>
> > > Ok, but... why? What does 32-bit words get you that UTF-8 does not?
> > > I can't think of a single advantage, just lots of disadvantages.
> >
> > The advantage is that you can use them to store UCS-4.
>
> Wrong. UTF-8 can store UCS-4 characters just fine.

Does just fine include unambiguously? Sure, standards-conforming
UTF-8 is unambiguous, but you've already said time and again that that
doesn't happen in the real world. I just don't agree on the UTF-8 can
store UCS-4 characters just fine thing _at all_.

> Admittedly you might need up to six octets for the worst case, but hey,
> since you only need one for the most common case (by _far_), who cares?
>
> And with the same UTF-8 encoding, you could some day encode UCS-8 too if
> the idiotic standards bodies some day decide that 4 billion characters
> isn't enough because of all the in-fighting.
>
> > Now, for file _contents_ this would be a compatibility disaster, which
> > is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
> > unambiguously represent any string of Unicode characters.
>
> Why do you think UTF-8 can't do this? Did you read some middle-aged text
> written by monks in a monestary that said that UTF-8 encodes a 16-bit
> character set?

At the end of the day, I just don't see how your suggestion of leaving
UTF-8 undecoded unless you're presenting it to the user is ever going
to be practical, which brings us back to my first point, that UTF-8
can't, in the real world, represent UCS-4 characters acceptably,
(I.E. unambiguously).

> > Basically - no more multiple representations of the same thing. No more
> > funny corner cases where several different strings of bytes eventually
> > resolve to the same name being presented to the user.
>
> Welcome to normalized UTF-8. And realize that the "non-normalized" broken
> stuff is what allows us backwards compatibility.
>
> Of course, since you like UCS-4, you don't care about backwards
> compatibility.

I don't particularly like UCS-4, I do care about backwards
compatibility, and addressed it right from the begining.

..and I totally don't get the bit about "non-normalised" UTF-8 being
what allows backwards compatibility. Compatibility with what!?
Existing broken implementations? Real, standards compliant UTF-8 is
fully backwards compatible with 7-bit ASCII, which is really just
about all any standard which wants to get accepted as a universal
standard can hope to be compatible with.

John.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/