Bugs around NLS Codepage 932 (Japanese)

From: Yoshi KAJIKI (kajiki@3dpro.tao.go.jp)
Date: Sat Jun 24 2000 - 14:11:02 EST


Hi all,

I found bugs around nls_cp932 (Japanese) on current 2.2.X.

------------------------
The first problem.

1. Problem
    All files using 'JIS X 0201 Kana' character set for filename
are saved as incorrect filename in vfat filesystem.

2. Relevant releases / people
    From kernel 2.2.16 to 2.2.17pre6 (current)
    Mainly for Japanese guys.

3. Abstract and patch
    The nls_cp932 module has three algorithms for converting and one
of them has minor mistakes, the byte order is reversed. Therefore, all
characters using the algorithm are not converted correctly. To solve this
problem, please correct 2 or 4 lines by following.

/******** For 2.2.16 (2 lines): ********/
--- nls_cp932.c Thu Jun 8 06:26:43 2000
+++ nls_cp932.c.new Fri Jun 23 19:49:48 2000
@@ -9772,8 +9772,8 @@
        ch = rawstring[0];
        cl = rawstring[1];
        if (0xA0 < ch && ch < 0xE0){
- *uni1 = 0xFF;
- *uni2 = ch - 0x40;
+ *uni1 = ch - 0x40;
+ *uni2 = 0xFF;
                *offset = 1;
                return;
        }

/******** For 2.2.17pre6 (4 lines): ********/
--- nls_cp932_euc_jp.c Thu Jun 22 11:51:00 2000
+++ nls_cp932_euc_jp.c.new Fri Jun 23 01:33:10 2000
@@ -9841,8 +9841,8 @@
         ch = rawstring[0];
         cl = rawstring[1];
         if (0xA1 <= ch && ch <= 0xDF){
- *uni1 = 0xFF;
- *uni2 = ch - 0x40;
+ *uni1 = ch - 0x40;
+ *uni2 = 0xFF;
                 *offset = 1;
                 return;
         }
@@ -9921,8 +9921,8 @@
 
         if (rawstring[0] == 0x8E) {
                 if (0xA1 <= rawstring[1] && rawstring[1] <= 0xDF){
- *uni1 = 0xFF;
- *uni2 = rawstring[1] - 0x40;
+ *uni1 = rawstring[1] - 0x40;
+ *uni2 = 0xFF;
                         *offset = 2;
                         return;
                 }

4. Detail
    Actually, I found these bugs by saving 'JIS X 0201 Kana' files in
vfat filesystem on Japanese MS-Windows and confirm the effect of my patch.
But I suppose many of you don't have Japanese MS-Windows, and so you may
not be able to reproduce bugs and effect of the patch. Therefore, I cannot
but write long description using scheme and algorithm.

    The codepage 932 module, 'fs/nls/nls_cp932.c' is capable of converting
Japanese character sets from the Shift_JIS encoding scheme into the Unicode
UCS-2 encoding scheme. A conversion table is used for Unicode conversion
and the table can be download from following sites.

    ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT
    http://www.microsoft.com/typography/unicode/932.txt

    The codepage 932 includes two character sets, 'JIS X 0208' and 'JIS X 0201,'
and they must to convert by different algorithm. Moreover, one of them,
'JIS X 0201,' is divided into two groups, called 'Roman' and 'Kana,' and
they must to convert by different algorithm, therefore three algorithms
are used in the nls_cp932 module.

    To convert the character from Shift_JIS to UCS-2, following function
is used.

 1: static void char2uni(unsigned char *rawstring, int *offset,
 2: unsigned char *uni1, unsigned char *uni2)
 3: {
 4: unsigned char ch, cl;
 5: struct nls_unicode *charset2uni;
 6:
 7: ch = rawstring[0];
 8: cl = rawstring[1];
 9: if (0xA1 <= ch && ch <= 0xDF){
10: *uni1 = 0xFF;
11: *uni2 = ch - 0x40;
12: *offset = 1;
13: return;
14: }
15: charset2uni = page_charset2uni[ch];
16: if (charset2uni && cl){
17: *uni1 = charset2uni[cl].uni2;
18: *uni2 = charset2uni[cl].uni1;
19: *offset = 2;
(snip)

A: Correctly converted case: 'JIS X 0208'
    According to the Unicode table, 'Katakana A' character of JIS X 0208
character set is encoded as 0x8341 by Shift_JIS, and it is encoded as
0x30A2 by UCS-2.

    When the function is called 'ch' and 'cl' are set to ch=0x83, cl=0x41
for 0x8341. In this case the 'if statement' of line #9 is FALSE.
At line #15, page_charset2uni is a pointer table of tables. Here,

  page_charset2uni[0x83] = c2u_83
  c2u_83[0x41] = {0x30, 0xA2}

Therefore,

  *uni1=c2u_83[0x41].uni2=0xA2
  *uni2=c2u_83[0x41].uni1=0x30

It means 0x30A2 on UCS-2 therefore '*uni2 and *uni1' order is correct.

B: Incorrectry converted case: 'JIS X 0201 Kana'
    According to the Unicode table, 'Half Width Katakana A' of JIS X 0201
is encoded as 0xB1 by Shift_JIS, and it is 0xFF71 by UCS-2.

    When the function is called, ch=0xB1. In this case the 'if statement'
of line #9 is TRUE. Hear,

  *uni1 = 0xFF
  *uni2 = ch - 0x40 = 0x71

According to '*uni2 and *uni1' order, this code means 0x71FF, but 'Half
Width Katakana A' is 0xFF71 in UCS-2. In other words, the byte order is
reversed.

    Moreover, kernel 2.2.17pre6 is capable of handling the EUC-JP
encoding scheme and the byte order of UCS-2 is also reversed. For
example, 0xB1 in Shift_JIS is encoded as 0x8EB1 by EUC-JP, 0xFF71 by UCS-2,
but EUC-JP 0x8EB1 is also converted to UCS-2 0x71FF by 2.2.17pre6.

---------------------------

Yet another problem arouud NLS cp932 on 2.2.17pre6.

1. Problem
    Module 'nls_cp932_euc_jp' is not installed with kernel automatically.

2. Relevant releases
    kernel 2.2.17pre6 (current)

3. Description
    Normally, 'nls_cp???' module is installed automatically when the
filesystem is mounted. To determine the name of module, 'codepage=???'
option of 'mount' command is used. At the case of cp932 on 2.2.17pre6,
the name of module has changed from 'nls_cp932' to 'nls_cp932_euc_jp'.
There is no 'nls_cp932,' and so automatic installation carried out.

4. Solution
    Please rename again to 'nls_cp932'.

That's all.

Thanks for your attention.

-- 
Yoshihiro Kajiki             <kajiki@kajiki.com>
Yokohama Linux Users Group   <kajiki@ylug.org>
Penguin-Club                 <kajiki@penguin-club.org>

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jun 26 2000 - 21:00:05 EST