Re: OT: character encodings (was: Linux 2.6.20-rc4)

From: David Woodhouse
Date: Sun Jan 07 2007 - 11:29:11 EST


On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> > On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > > Russell King schrieb:
> > > > Welcome to the mess which the UTF-8 charset creates.
> >
> > Utter bollocks.
>
> Wrong. The problem is partly caused by not everything understanding
> multi-byte character encodings,

No, that's a different problem; not the one you were referring to above.
And it's a problem which is rapidly diminishing, too.

> and text files containing absolutely
> _no_ information about their character encodings.

That's a real problem, yes -- but it was a problem long before UTF-8 was
added to the collection of character sets in use. Even within the UK, we
had to choose between ISO8859-1 and ISO8859-15.

> When a text file is stored on disk, there's no way to tell what
> character set the characters in that file belong to. As a result,
> ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> UTF-8 folk assume all text files are UTF-8 encoded. This leads to
> utter confusion.

Only if you are making different assumptions about the _same_ set of
files, on the _same_ system. But that would be silly.

If I suddenly "assume" that my laptop has a Dvorak keyboard layout
despite that blatantly not being true, I'll get the same kind of
confusion. That isn't Dvorak's fault, either.

If, on the other hand, I have one system which is entirely ISO8859-1 and
a separate system which is entirely UTF-8, each of those are _fine_ and
unconfusing. Obviously I have to make sure files are properly labelled
and converted in transport between different systems -- but that's
nothing new.

> To see what I mean, try the following:
>
> $ git log | head -n 1000 > o
> $ file -i o
> o: text/x-c; charset=iso-8859-1
>
> According to that, the charset of the 'git log' output (which on that
> test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> was right to include it as ISO-8859-1.

Yes. When you stored it on disk, the character set information was lost.
If you were running a mixed-charset system then attempting to recreating
the lost information with heuristics and assumptions is obviously going
to be problematic.

Actually, because UTF-8 allows me to run a system which is purely based
on a single character set, I get better results when I try the same
trick:
shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
shinybook /shiny/git/mtd-2.6 $ file -i o
o: text/plain; charset=utf-8

Again, the problem of labelling isn't at all new to UTF-8. The only
thing that's new with UTF-8 is that it's now actually _practical_ to
have a system which only uses one character set throughout, and which
thus _can_ get its 'guess' right when you don't bother to label
everything.

> In reality, the output from git log contains an ad-hoc collection of
> character sets making its interpretation under any one character set
> incorrect.

No, the contents of the git log ought to be UTF-8, unless people have
been misusing it. Git stores its text in UTF-8 (by default), and is
capable of converting to and from legacy character sets on input
(git-commit) and output (git-log).

(Obviously, that's likely to be lossy if you convert it to any given
legacy character set, because â legacy character set, â characters
within UTF-8 that aren't in that legacy character set.)

> > Far from being the cause of the problem, UTF-8 actually offers the
> > chance of a _solution_. Because once the Luddites catch up, it'll
> > largely eliminate the need for using the multitude of legacy character
> > sets and converting between them -- and the problem of mislabelling will
> > fairly much go away.
>
> In other words, the UTF-8 luddites require the entire Internet to
> upgrade to UTF-8 for UTF-8 to work properly.

Not at all. The problems arise when character set information is lost,
which can happen at any point during the flow of information.

Anything we can do to reduce the likelihood of charset information being
lost is an overall improvement. We already demonstrated an example
(git-log > o; file -i o) of a case where a _consistent_ system gets it
right, while an inconsistent system introduces an error.

If any individual system processes all text in a single character set,
then that system is no longer a likely source of corruption due to
labelling errors. And because UTF-8 fully covers the set of characters
which can be represented in the legacy character sets, it allows us to
deploy systems which do just that.

> I _regularly_ struggle with idiotic programs that assume that the world
> is UTF-8 and nothing else.

I don't think I've encountered such a program in my distribution of
choice. If I had, I would have filed a bug. Making assumptions about
character sets, outside of the locally-controlled environment, is
invalid. That's been true since the first 8-bit character sets, if not
longer.

> So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> is UTF-8 enabled. If you're operating in a mixed charset environment
> it's one bloody big pain in the butt.

A mixed charset environment was _already_ a pain in the butt, because
almost nobody got labelling right. It's wrong to blame that on UTF-8.

--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/