Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)

From: Robin Rosenberg
Date: Tue Feb 17 2004 - 17:55:53 EST


On Tuesday 17 February 2004 22.16, John Bradford wrote:
> Quote from Linus Torvalds <torvalds@xxxxxxxx>:
> > On Tue, 17 Feb 2004, John Bradford wrote:
> > > > Ok, but... why? What does 32-bit words get you that UTF-8 does not?
> > > > I can't think of a single advantage, just lots of disadvantages.
> > > The advantage is that you can use them to store UCS-4.
> > Wrong. UTF-8 can store UCS-4 characters just fine.
<nitpick>Yes and no. There are no UTF-8 or UCS-4 characters. These are encodings
for Unicode characters. </nitpick>

> At the end of the day, I just don't see how your suggestion of leaving
> UTF-8 undecoded unless you're presenting it to the user is ever going
> to be practical, which brings us back to my first point, that UTF-8
> can't, in the real world, represent UCS-4 characters acceptably,
> (I.E. unambiguously).
The standard say a decode should not accept invalid UTF-8 characters. Not
decoding them and just pass the garbage on is one way of not "accepting" them; i.e.
"i'm not decoding this trash". In UTF-8 seen as a byte stream this is trivial. Those that
recode to a 16-bit encoding like QT (recode them to invalid UTF-16 so it can be encoded
back to the original invalid UTF-8). That's at least what the kopete people told me. (Haven't
read the code yet). With UCS-4 or UCS-2 the decoder must reject the data or make a very
good decision. Normalizing is a very bad one, since, as we know, an invalid UTF-8 sequence
simply does not represent a unicode character.

> > > Basically - no more multiple representations of the same thing. No more
> > > funny corner cases where several different strings of bytes eventually
> > > resolve to the same name being presented to the user.
> >
> > Welcome to normalized UTF-8. And realize that the "non-normalized" broken
> > stuff is what allows us backwards compatibility.
And then (above) there is no normalized UTF-8. There are strings of valid characters
and invalid characters. Any app that tries to make sense of any garbage is a security
risk. This apples not only to UTF-8. but function like atof that decodes anything
to a double at best NaN).

-- robin


-- robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/