Re: OT: character encodings

From: Tilman Schmidt
Date: Sun Jan 07 2007 - 20:28:53 EST


Am 08.01.2007 01:38 schrieb Willy Tarreau:
> I'm not blaming UTF-8 per se, but people who still believe in encoding
> *whole documents*. Copy-paste, text insertion, git output, etc... everything
> has a good reason not to be in the same encoding as what your MUA believes.
> If major MUAs still have problems with UTF-8 10 years after it was introduced,
> it's clearly the proof of a flaw in the initial design.

Actually it's working amazingly well. In the past 15 years the frequency of
character encoding problems has gone from "frequent, that's just the way it
is, if you want your text to go through unharmed then stick to 7 bit ASCII"
to "rare, it normally works, if it doesn't then we've got a problem to solve".
Copy/paste? Works for me! Mail forwarding? Works for me!

The only thing that doesn't and cannot work is automatic conversion of data
with unknown encoding or with incorrect encoding information, and that has
to be solved at the source. If you store differently encoded text in git
without labelling it accordingly then there is just no way to retrieve it
consistently. There are two ways out of that - a complicated one: storing
encoding information with every piece of text, and a simple one - declaring
a single internal encoding to which everything must be converted upon entry
into the database. Guess which one has more chances of working well.

> And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

Personally I find the requirement to know the number of characters in a text
rather unusual, so I wouldn't base a decision for an encoding on that. In
fact, I cannot remember ever really wanting to know the actual number of
characters in a text. The number of bytes occupied on storage, ok. The
number of letters, of words, of lines, perhaps even the number of printable
characters, all potentially interesting, depending on the application.
But the raw number of characters? I don't know what that might serve for.

HTH
Tilman

--
Tilman Schmidt E-Mail: tilman@xxxxxxx
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)

Attachment: signature.asc
Description: OpenPGP digital signature