Re: Unicode conversion issue

From: Linus Torvalds
Date: Wed Dec 11 2024 - 15:18:53 EST


On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> The problem is that all the filesystems basically do some variation of
>
> if (IS_CASEFOLDED(dir) ..) {
>
> len = utf8_casefold(sb->s_encoding, orig_name,
> new_name, MAXLEN);
>
> and then they use that "new_name" for both hashing and for comparisons.

Oh, actually, f2fs does pass in the original name to
generic_ci_match(), so I think this is solvable.

The solution involves just telling f2fs to ignore the hash if it has
seen odd characters.

So I think f2fs could actually do something like this:

--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
/* fall back to treating name as opaque byte sequence */
return 0;
}
+ fname->ignore_hash = utf8_oddname(fname->usr_fname);
fname->cf_name.name = buf;
fname->cf_name.len = len;
}
@@ -231,7 +232,7 @@ struct f2fs_dir_entry
*f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
continue;
}

- if (de->hash_code == fname->hash) {
+ if (fname->ignore_hash || de->hash_code == fname->hash) {
res = f2fs_match_name(d->inode, fname,
d->filename[bit_pos],
le16_to_cpu(de->name_len));
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -521,6 +521,7 @@ struct f2fs_filename {

/* The dirhash of this filename */
f2fs_hash_t hash;
+ bool ignore_hash;

#ifdef CONFIG_FS_ENCRYPTION
/*

where that "utf8_oddname()" is the one that goes "this filename
contains unhashable characters".

I didn't look very closely at what ext4 does, but it seems to already
have a pattern for "don't even look at the hash because it's not
reliable", so I think ext4 can do something similar.

So then all you actually need is that utf8_oddname() that recognizes
those ignored code-points.

So I take it all back: option (1) actually doesn't look that bad, and
would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
ignorable code points") unnecessary.

Linus