Re: UTF-8 practically vs. theoretically in the VFS API

From: bert hubert
Date: Mon Feb 16 2004 - 15:24:52 EST


On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds wrote:

> The way to handle that is to aim to never _ever_ decode utf-8 unless you
> really have to. Always leave the string in utf-8 "raw bytestring" mode as
> long as possible, and convert to charater sets only when actually
> printing.

Additional good news is that following octets in a utf-8 character sequence
always have the highest order bit set, precluding / or \x0 from appearing,
confusing the kernel.

The remaining zit is that all these represent '..':
2E 2E
C0 AE C0 AE
E0 80 AE E0 80 AE
F0 80 80 AE F0 80 80 AE
F8 80 80 80 AE F8 80 80 80 AE
FC 80 80 80 80 AE FC 80 80 80 80 AE

This in itself is not a problem, the kernel will only recognize 2E 2E as the
real .., but it does show that 'document.doc' might be encoded in a myriad
ways.

So some guidance about using only the simplest possible encoding might be
sensible, if we don't want the kernel to know about utf-8.

> And it largely works today.

Indeed.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/