www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - utf-32 text

reply "Carlos Santander B." <carlos8294 msn.com> writes:
Somebody enlighten (sp?) me, please.
AFAIU, this code:

//////////////////////////////////
import std.file;
import std.utf;

void main ()
{
    char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`;
    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
}

//////////////////////////////////

Should produce a valid D program:
import std.stdio; void main() { writefln("adiós"); }

In fact, DMD accepts it. Now my questions:
1. How do I edit the created file (test32.d)? I tried a number of different
editors and not even one of them could display the text correctly. Notepad shows
something like " i m p o r t ..." and that's the general case (NULL before every
letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
them too.
2. UTF32 is always 4 bytes per character, right? Then why did the resulting
program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing
showed it was the output as if it was UTF8. Did I miss something in the process?
(FWIW, the original file was saved as UTF8 and UTF16-BE).
3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
exactly the same. Do BE or LE matter at all?. However I could do this:
"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
\U0000fffe"). Why is that? Is that the correct way to use \u?
4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
[], does DMD convert it automatically or does it produce an invalid string?

Take for what it is: just ignorance.

-----------------------
Carlos Santander Bernal
Sep 06 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <chja8o$20d3$1 digitaldaemon.com>, Carlos Santander B. says...
Somebody enlighten (sp?) me, please.
AFAIU, this code:

//////////////////////////////////
import std.file;
import std.utf;

void main ()
{
    char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`;
    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
}

//////////////////////////////////

Should produce a valid D program:

It does.
In fact, DMD accepts it.

As it should.
Now my questions:
1. How do I edit the created file (test32.d)? I tried a number of different
editors and not even one of them could display the text correctly.

That's because most Windows text editors don't grok UTF-32. You can blame this on Microsoft. Microsoft incorrectly lists the following encodings: * ANSI SHOULD BE: WINDOWS-1252 (NOT an ANSI standard) * Unicode SHOULD BE: UTF-16LE * Unicode (big endian) SHOULD BE: UTF-16BE and most Windows text editors follow suit.
Notepad shows
something like " i m p o r t ..." and that's the general case (NULL before every
letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
them too.

You will have to ask individual text editor vendors that. One editor which gets it /right/ is SC Unipad (www.unipad.org). Unfortunately this is hideously expensive.
2. UTF32 is always 4 bytes per character, right? Then why did the resulting
program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing
showed it was the output as if it was UTF8. Did I miss something in the process?
(FWIW, the original file was saved as UTF8 and UTF16-BE).

You didn't miss anything. Blame it on the text editor.
3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
exactly the same. Do BE or LE matter at all?.

To Unicode, yes. To an application which doesn't understand it, no.
However I could do this:
"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
\U0000fffe"). Why is that? Is that the correct way to use \u?

\u is used to denote a Unicode codepoint, and nothing else. It should /not/ be used to inject bytes into a byte array. The actual bytes inserted will depend on the encoding of the character literal -- normally UTF-8 in D, although there are arguments that D should be more flexible in this regard. The phrase "invalid UTF character" is meaningless, since there is no such thing as a "UTF character". However, U+FFFE is a noncharacter codepoint, and it is indeed invalid to find such a codepoint in a conformant Unicode string (which of course is precisely why U+FEFF was chosen as the byte-order-mark).
4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
[], does DMD convert it automatically or does it produce an invalid string?

Current DMD behavior is: *) COMPILE-TIME constants are converted. *) Values known only at RUN-TIME are not. Again, plenty of us believe that this is not the best way for DMD to behave, and that implicit conversion should happen always, just as it does from short to int, because such conversions generate zero loss of information. Arcane Jill
Sep 07 2004
next sibling parent "Carlos Santander B." <carlos8294 msn.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> escribió en el mensaje
news:chjpcj$284s$1 digitaldaemon.com
|
| ...
|
| Arcane Jill

Thanks

-----------------------
Carlos Santander Bernal
Sep 07 2004
prev sibling parent reply James McComb <alan jamesmccomb.id.au> writes:
Arcane Jill wrote:

 Again, plenty of us believe that this is not the best way for DMD to behave,
and
 that implicit conversion should happen always, just as it does from short to
 int, because such conversions generate zero loss of information.

That sounds like a beatiful knockdown argument: short-->int does not lose information, so it happens implicitly. dchar-->char does not lose information, so it should happen implicitly. +1 for implicit conversions between char, wchar and dchar. James McComb
Sep 08 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <chognu$1hnj$1 digitaldaemon.com>, James McComb says...

That sounds like a beatiful knockdown argument:

short-->int does not lose information, so it happens implicitly.

Yes. That's what happens now, and it's perfectly sensible.
dchar-->char does not lose information, so it should happen implicitly.

Huh? I think you may be a little confused there. dchar-->char is not lossless. (And for that matter, char-->dchar, IMO, should either require an explicit cast or throw a UTF exception if char value >0x80).
+1 for implicit conversions between char, wchar and dchar.

Lossless conversion is possible between char[], wchar[] and dchar[] - but /not/ between char, wchar and dchar. Please be aware of the difference. Arcane Jill
Sep 08 2004