Re: A Great Idea (tm) about reimplementing NLS.

From: Måns Rullgård
Date: Fri Jun 17 2005 - 08:25:53 EST


lsorense@xxxxxxxxxxxxxxxxxxx (Lennart Sorensen) writes:
> You have probably slightly misunderstood UTF8 at least. UTF8 tries very
> hard to make sure you can't mistake the characters for ascii, so it
> makes the first byte contains some 1's follwed by one zero. The number
> of 1's indicates how many bytes the character contains, after the 0 the
> remaining bits is used to store bits for the character. The remaining
> bytes are all 10xxxxxx which stores another 6 bites of the character code.
> One is required to use the shortest form of utf8 that can store the
> character you are encoding.

Some characters can be encoded in several equally shortest ways. For
instance, characters with multiple diacritics can have these applied
in different orders. One of these is designated the canonical
encoding, and should be used in favor of the others. Those things,
among others, are what makes unicode difficult to deal with.

--
Måns Rullgård
mru@xxxxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/