RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

From: Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx
Date: Tue Apr 14 2020 - 05:32:33 EST


> We do not know how code points above U+FFFF could be converted to upper case.

Code points above U+FFFF do not need to be converted to uppercase.

> Basically from exfat specification can be deduced it only for
> U+0000 .. U+FFFF code points.

exFAT specifications (sec.7.2.5.1) saids ...
-- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).

UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
It just says "Unicode".


> Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> of single / half surrogate.

Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
The character type is basically 'wchar_t'(16bit).
The description "0000h to FFFFh" also assumes the use of 'wchar_t'.

This â0000h to FFFFhâ also includes surrogate characters(U+D800 to U+DFFF),
but these should not be converted to upper case.
Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
(* RtlUpcaseUnicodeChar() is one of Windows native API)

If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.

The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
To be more strict...
D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

> Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8
> encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction
> for surrogate pairs values.

WTF-8 is new to me.
That's an interesting idea, but is it needed for exfat?

For characters over U+FFFF,
-For UTF-32, a value of 0x10000 or more
-For UTF-16, the value from 0xd800 to 0xdfff
I think these are just "don't convert to uppercase."

If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
it will simply be ignored by utf16s_to_utf8s().
string after utf8 conversion does not include illegal byte sequence.


> Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.

Ugh...


BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx>