www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The encoding of Windows/Linux filenames

reply Arcane Jill <Arcane_member pathlink.com> writes:
To clarify a point made in another post... Suppose you have a file called
"café". This filename will be stored within the Window filesystem as a sequence
of 16-bit words, each representing a Unicode character, which in this case will
be the sequence { 0x0063, 0x0061, 0x0066, 0x00C9 }. (This is not of course true
with old DOS 8.3 filenames, but we'll ignore them).

The D char[] string, "café", on the other hand, will be stored as the five-byte
UTF-8 sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 }.

However, when you call the char version of CreateFile(), the filename string
will be interpretted as if it were encoded in the default windows codepage
(normally WINDOWS-1252 in English-speaking countries. Under this interpretation,
the byte sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 } will be seen as the string
"cafÃ?", and so Windows will attempt to open a file of that name. So, either it
will fail, or it will open the wrong file.

The fix, of course, to pass to CreateFile() the value
(std.utf.toUTF16(filename)), instead of (filename). This should not have to be
done by users - it needs to be done at the Phobos level.

The situation is more complicated on Linux, unfortunately. On Linux filenames
are stored as a sequence of bytes, not 16-bit-words. On one level that sequence
of bytes is kind of "raw" - fopen() can be passed any sequence of bytes not
containing "/" or "\0", and it will consider a filename to match only if it is
byte-for-byte identical. However, this does not really mitigate the problem,
because bytes only turn into characters - even 8-bit-wide ones - when you
interpret them according to an encoding. Thus, if you have C source code which
says fopen("café", "r"), your C++ compiler will still need to know what sequence
of bytes should represent these characters. By and large, it will assume the
system default encoding, called the "locale" in Linux-speak (although it has
very little to do with the ISO langauage-country-variant understanding of
"locale"). Some Linux users will have set their default "locale" to UTF-8.
Others won't. Getting this right will be tricky.

Unfortunately, you can't ignore this problem. Unless you want to tell people
that D's File (FileStream?) class will only work for filenames containing ASCII
characters, that it - and that is hardly a realistic option if you want D to
compete seriously with C++ and Java.

It will be easier to fix this for Windows, for the reasons given above. I think,
at least, that should happen as part of the ongoing std.stream improving.
Someone who knows more about Linux encoding will have to help out on the Linux
fix.

Arcane Jill
Jun 26 2004
parent reply "Walter" <newshound digitalmars.com> writes:
This issue is already fixed for std.file operations under Win32, this fix
just needs to be propagated to std.stream. For linux, the file name
operations assume the linux APIs take UTF-8. I don't know how to do code
pages in linux, so this will have to wait until I figure it out <g>.
Jun 26 2004
parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"Walter" <newshound digitalmars.com> escribió en el mensaje
news:cbkbv8$abd$2 digitaldaemon.com
| This issue is already fixed for std.file operations under Win32, this fix
| just needs to be propagated to std.stream.

... which I already did, and posted in the bugs ng.

| For linux, the file name
| operations assume the linux APIs take UTF-8. I don't know how to do code
| pages in linux, so this will have to wait until I figure it out <g>.

-----------------------
Carlos Santander Bernal
Jun 26 2004
parent "Walter" <newshound digitalmars.com> writes:
"Carlos Santander B." <carlos8294 msn.com> wrote in message
news:cbkdvf$d4a$1 digitaldaemon.com...
 "Walter" <newshound digitalmars.com> escribió en el mensaje
 news:cbkbv8$abd$2 digitaldaemon.com
 | This issue is already fixed for std.file operations under Win32, this

 | just needs to be propagated to std.stream.

 ... which I already did, and posted in the bugs ng.

Yes, you did.
Jun 26 2004