Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

James Mastros (root@jennifer-unix.dyn.ml.org)
Thu, 21 Aug 1997 00:21:22 -0400 (EDT)


On Wed, 20 Aug 1997, Alex Belits wrote:

> On Wed, 20 Aug 1997, Peter Holzer wrote:
> > > It should be possible to _choose_ mapping as the mount option, not
> > >"UTF-8 or all filenames will be truncated to the first letter because
> > >second one is zero".
> > You are mixing up 16-Bit Unicode and UTF-8 here. In UTF-8, Unicode
> > characters 0000 to 007f are mapped to single bytes with the same value.
> > All other codes are mapped to multi-byte sequences where all bytes have
> > the MSB set.
> But if the only alternatives will be UTF-8 or "no translation at all",
> that will leave only UTF-8 usable -- taking plain ASCII filename in the
> form how it's stored on NTFS (16-bt Unicode) produces a string,
> unusuitable for any string processing. IMHO if one wants to support such a
> thing, replaceable name-translation interfaces should be used, not
> hardcoded UTF-8.

How about this: All filesystems give textual metadata (filenames, that is)
in UTF-8 (so that most things will mostly work, before they are re-written
to deal with UTF-8 explicitly). We will have a interface for loading
translation tables. I would suguest a module (similar to kernel modules,
you would have to be able to compile these in) that calls a
register_trans(uni_to_charset_fn, charset_to_uni_fn, unload_calback,
charset_name) function. When writing/reading userspace stuff, everything
should go through the translation functions.

So, say a NTFS filesystem is being mounted.
1) The NTFS mounts as before
2) The char-table on the filesystem gets loaded into a pair of tables.
3) The tables are registered.

Now, a "ls" is done... For each filename, this is done:
4) The filename is retrivied by NTFS.
5) NTFS calls the translation function to convert to UTF-8.
6) The VFS calls the translation function to convert from UTF-8
to ls' charset. (We would have to have a way to inform the kernel what
charset user-space wants. I think /proc/charset would be best, with a
matching sysctl... The default should be UTF-8, IMHO.)

Then, ls sends the filenames out to the console (note that the current
console code does this totaly differently)
7) Characters coming into /dev/tty* are passed through the translation
function and converted into UTF-8.
8) Characters are converted out to the display charset before writing them
to the physical output device.

Pros and Cons, most of which I'm probably missing:
1) It uses a consistent internal representation, which (in theory) should be
able to represent anything that any local charset does.
2) It allows each piece of the system to use a charset unrelated from any
of the others.
3) It converts to much...
4) It's overcomplex...

The way I see it, the trans_string function should look like this...
struct tstr {
int charset
char *data
long len /* In bytes, not charcters... to avoid overflow */
} (yes, this does look a lot like a qstr.

tstr *trans_string(tstr *input, int out_charset)

trans_string would return *{out_charset, different *data, possibly different
length} (on error, it would be *{CHARSET_ERROR, *error_text (in UTF-8),
len}). (Yes, I know that isn't proper C, you know what I meant, no?)

It would simply:
1) Check if input->charset == out_charset, and if so, return *input.
2) Pass input to input->charset's charset-to-unicode function.
3) Pass that through out_charset's unicode-to-charset function.
4) return a pointer to the new tstr.

I think that that's it...

-=- James Mastros