www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - non-ascii names and recls

reply "Carlos Santander B." <carlos8294 msn.com> writes:
(using dmd 0.96, Windows 95 and Windows XP Pro)
std.recls.File et al don't return UTF-8 strings as they should. Apparently
they return UTF-16, as shown by this:

///////////////////////////
import std.recls;
import std.stdio;
import std.utf;

void main ()
{
    Search s = new Search ( ".", "*.*", RECLS_FLAG.RECLS_F_FILES );
    foreach ( Entry e; s )
        //writefln(e.File);                      //line 9
        writefln( fix(e.File)  );                      //line 10
}

char [] fix ( char [] x )
{
    wchar []  r;
    r.length = x.length;
    for ( uint i;i<x.length;++i)
        r[i] = x[i];
    return toUTF8 ( r );
}
///////////////////////////

If I use line 9 instead of line 10, and I happen to have a file which name
contains non-ASCII characters (like "año"), I get "Error: invalid UTF-8
sequence".

-----------------------
Carlos Santander Bernal
Jul 27 2004
parent reply "Matthew" <admin.hat stlsoft.dot.org> writes:
That's odd.

"año.txt" works fine with the C API of recls.

I'll now have a play with the D mapping. The problem I have is that the version
of recls in Phobos is two versions older
than the one I have - don't ask me why, I'm not quite sure why that is.


...

Hmm, I've played around a little, and it works fine with a printf().

//  writefln(e.File);                      //line 9
  printf("%.*s\n", e.File);

I'm now at the extent of my understanding wrt D's implementation of UTF. If it
prints fine with printf(), then what's
wrong?

btw, (and I suspect this is important), the following code causes the compiler
to emit "invalid UTF-8 sequence"

  writefln("año.txt");

Given that, I reckon this has got nothing to do with recls.


"Carlos Santander B." <carlos8294 msn.com> wrote in message
news:ce6na7$2ndm$1 digitaldaemon.com...
 (using dmd 0.96, Windows 95 and Windows XP Pro)
 std.recls.File et al don't return UTF-8 strings as they should. Apparently
 they return UTF-16, as shown by this:

 ///////////////////////////
 import std.recls;
 import std.stdio;
 import std.utf;

 void main ()
 {
     Search s = new Search ( ".", "*.*", RECLS_FLAG.RECLS_F_FILES );
     foreach ( Entry e; s )
         //writefln(e.File);                      //line 9
         writefln( fix(e.File)  );                      //line 10
 }

 char [] fix ( char [] x )
 {
     wchar []  r;
     r.length = x.length;
     for ( uint i;i<x.length;++i)
         r[i] = x[i];
     return toUTF8 ( r );
 }
 ///////////////////////////

 If I use line 9 instead of line 10, and I happen to have a file which name
 contains non-ASCII characters (like "año"), I get "Error: invalid UTF-8
 sequence".

 -----------------------
 Carlos Santander Bernal
Jul 27 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce7c89$2vcj$1 digitaldaemon.com>, Matthew says...
That's odd.
I know nothing about recls (couldn't find the docs), and so don't know what it's supposed to do. If it helps, though, I can tell you that Windows filenames are stored in UCS-2 (which is forwardly compatible with UTF-16). I know that Windows doesn't do Unicode terrifically well, but if you use their "wide" character functions you'll get something which you can pretend is UTF-16 and everything should work fine. (UCS-2 is basically UTF-16 restricted to the codepoint range U+0000 to U+FFFF). I have absolutely no idea if this is helpful or not, so I'll shut up now. Jill
Jul 28 2004
prev sibling next sibling parent "Walter" <newshound digitalmars.com> writes:
"Matthew" <admin.hat stlsoft.dot.org> wrote in message
news:ce7c89$2vcj$1 digitaldaemon.com...
 That's odd.

 "año.txt" works fine with the C API of recls.

 I'll now have a play with the D mapping. The problem I have is that the
version of recls in Phobos is two versions older
 than the one I have - don't ask me why, I'm not quite sure why that is.
 ...

 Hmm, I've played around a little, and it works fine with a printf().

 //  writefln(e.File);                      //line 9
   printf("%.*s\n", e.File);

 I'm now at the extent of my understanding wrt D's implementation of UTF.
If it prints fine with printf(), then what's
 wrong?

 btw, (and I suspect this is important), the following code causes the
compiler to emit "invalid UTF-8 sequence"
   writefln("año.txt");

 Given that, I reckon this has got nothing to do with recls.
The problem here is that the Windows "A" functions do not deal in UTF-8, they deal with characters based on whatever the current code page is. printf() knows nothing about unicode, it just spits characters back out using the C runtime library which uses the "A" functions. So, reading using "A" functions then writing using "A" functions appears to work fine. The trouble comes when interacting with Phobos that is expecting strings to be in UTF format. The solution is to use the "W" api functions whenever possible, which will get you UTF-16. Of course, Win9x doesn't support many "W" functions. The solution there is to use the "A" functions, then use MultiByteToWideChar() to convert it to UTF-16 using the current code page. UTF-16 strings can then be converted to char[] using std.utf.toUTF8(). It sounds more complicated than it is, for an example of how to do it, see std.file.listdir(). (It works in C because C knows nothing about unicode or UTF-8, it just reads byte strings from Win32 using the "A" functions and spits them back out using the "A" functions.)
Jul 28 2004
prev sibling next sibling parent "Carlos Santander B." <carlos8294 msn.com> writes:
"Matthew" <admin.hat stlsoft.dot.org> escribió en el mensaje
news:ce7c89$2vcj$1 digitaldaemon.com
| That's odd.
|
| "año.txt" works fine with the C API of recls.
|
| I'll now have a play with the D mapping. The problem I have is that the
version of
| recls in Phobos is two versions older than the one I have - don't ask me why,
I'm
| not quite sure why that is.
|
|
| ...
|
| Hmm, I've played around a little, and it works fine with a printf().
|
| //  writefln(e.File);                      //line 9
|   printf("%.*s\n", e.File);
|
| I'm now at the extent of my understanding wrt D's implementation of UTF. If it
| prints fine with printf(), then what's wrong?
|
| btw, (and I suspect this is important), the following code causes the compiler
to
| emit "invalid UTF-8 sequence"
|
|   writefln("año.txt");
|
| Given that, I reckon this has got nothing to do with recls.
|
|

Then use neither writef nor printf: use std.utf.validate to check it.

-----------------------
Carlos Santander Bernal
Jul 28 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce7c89$2vcj$1 digitaldaemon.com>, Matthew says...


btw, (and I suspect this is important), the following code causes the compiler
to emit "invalid UTF-8 sequence"

  writefln("año.txt");
If you were anyone else, I'd be tempted to ask if you'd remembered to save your source file in UTF-8 before trying to compile it, but you're much too intelligent to have made that mistake, surely? In the changelog for D 0.96, it says: "Invalid UTF characters in string literals now diagnosed.", so I would imagine that would be spotted by DMD now. This is a curous one. Arcane Jill
Jul 28 2004