Re: UTF-8 practically vs. theoretically in the VFS API

From: H. Peter Anvin
Date: Wed Feb 18 2004 - 00:31:53 EST


Linus Torvalds wrote:

On Tue, 17 Feb 2004, H. Peter Anvin wrote:

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.


I'd suggest just doing that when you convert the utf-8 format to printable format _anyway_. At that point you just make the "printable" representation be the binary escape sequence (which you have to have for other non-printable utf-8 characters anyway).


What does "printable" mean in this context? Typically you have to convert it to UCS-4 first, so you can index into your font tables, then you have to create the right composition, apply the bidirectional text algorithm, and so forth.

Rendering general Unicode text is complex enough that you really want it layered. What I described what the first step of that -- mostly trying to show that "throwing an error" doesn't necessarily mean "produce no output." What you shouldn't do, though, is alias it with legitimate input.

And if you do things right (ie you allow user input in that same escaped output format), you can allow users to re-create the exact "broken utf-8". Which is actually important just so that the user can fix it up (ie imagine the user noticing that the filename is broken, and now needs to do a "mv broken-name fixed-name" - the user needs some way to re-create the brokenness).

Indeed. The C language has gone with \x77 for bytes and \u7777 or \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I think this is a good UI for shells to follow. The \x representation then doesn't stand for characters but for bytes. It may be desirable to disallow encoding of *valid* UTF-8 characters this way, though.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/