Re: UTF-8 and case-insensitivity
From: Neil Brown
Date: Tue Feb 17 2004 - 19:15:21 EST
On Wednesday February 18, tridge@xxxxxxxxx wrote:
>
> Samba (and any other system that wants case-insensitive semantics on
> Linux) can't make do with "oh, it probably doesn't exist". That way
> leads to data loss. You have to know with 100% certainty that the file
> doesn't exist in any case combination.
>
> Unfortunately, that is also the hardest thing to do.
Hi Tridge,
Maybe if it is so hard, we should just define it to be easy.... just
change the universe a bit.....
I'm, sure you've thought about this a lot more that I have or will, so
I must be missing something, but there seems to be a solution that is
efficient, predictable, and should we acceptable.
The first observation is that POSIX applications and WIN32 application
cannot both get exactly the file system, semantics they expect in the
same directory. The example:
POSIX:
create "Makefile"
create "makefile"
WIN32:
unlink "MakeFile"
seems to show that.
So decide up front that a WIN32 application will see something
different, and decide what the best thing for it to see would be
(i.e. change the universe).
First cut:
An application that wants case-insensitive filenames only
sees those filenames that are in a case-insensitive-canonical-form.
So the interface maps all file names in requests to a canonical
form, and the readdir equivalent discards all non-canonical names.
Thus in the above example, the WIN32 app would unlink "makefile"
and never notice that "Makefile" exists.
This has (to me) two problems.
1/ case gets lost, so if I save "My File", I will find "my file"
has been created (unless the application pretty-cases things, in
which case I can expect case to change anyway).
2/ Files created by posix apps might be invisible.
To answer 2/, I'd say "tough". If you want posix files to be
visible to WIN32 apps, choose appropriate names. However I would
allow there to be a process, either once-off or periodic, which
creates symlinks from canonical names to non-canocial filenames.
This would allow you to access pre-existing files where there was
no ambiguity.
To answer 1/ I would suggest a second cut at the problem...
Second cut:
As above, but readdir tries to be clever. If it sees two (or
more) names which have the same canonical form, it chooses just
one of them (predictably), prefering a non-canonical name which is
a symlink to the canonical name.
Then when creating an a object, you create it with the canonical
name and (if that succeeds) subsequently create a symlink from the
requested name to the canonical name (if that is possible, don't
worry if it isn't).
Given this approach:
If only case-insensitive apps use a linux filesystem, they will see
exactly the semantics they expect, with minimal performance impact.
If case-sensitive and case-insensitive apps use a linux filesystem,
they will each see a consistent view and though they may not see the
same view, there will be well-defined mechanisms which can work at a
user-space level to resolve or highlight any issues.
The biggest cost I see with this is with large directories. The
"readdir" equivalent would need to read the whole directory before it
could reliably return any of it.
However dropping the "guarantee to preserve case" semantic on really
large directories probably isn't an enormous cost (and could be
configurable).
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/