Re: UTF-8 and case-insensitivity

From: Robin Rosenberg
Date: Wed Feb 18 2004 - 07:33:08 EST


On Wednesday 18 February 2004 12.43, tridge@xxxxxxxxx wrote:
> Robin,
> > I've read it also:
> > http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
> > "The fundamental representation of text in Windows NT-based
> > operating systems is UTF-16"

I believe (please correct me if this is wrong) that Windows never actually
supported any of the UCS-2 code that were in conflict with UTF-16. The cost
of this operation was that some of the "private" code blocks of unicode 2.0, i.e.
U+D800..U+DFFF were redefined as "surrogates" in Unicode 3.0 making the
UTF-16 encoding more or less backwards compatible with UCS-2. And it's
UTF-16LE and UCS-2LE, but I suspect you knew that :-)

> yep, in this thread I've been mistakenly using the term UCS-16 when I
> should have said UTF-16 (ie. the variable length, 2 byte encoding).
>
> Samba currently treats the bytes on the wire from windows as UCS-2 (a
> 2 byte fixed width encoding), whereas perhaps it should be treating
> them as UTF-16. I should write a smbtorture test to detect the
> difference and see what different versions of windows actually use.
See above, and most importantly the definition in Amendment 1 of the unicode
3.0 standard.

> luckily the new charset handling stuff in samba3 and samba4 will make
> this easy to fix :-)
Happy man!

-- robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/