www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - A couple of issues with UTF

reply Georg Wrede <georg nospam.org> writes:
Haa, look ma, no hands!

So we have implicit UTF conversion.
The following compiles ok!



$ cat outtest2fixed.d
import std.stream;

void main()
{
          char[] c = "Saatana perkele"c;
         wchar[] w = "Saatana perkele"w;

         File of1 = new File;
         File of2 = new File;

         of1.create("/tmp/1f.txt");
         of2.create("/tmp/2f.txt");

         of1.write(c);
         of1.write(c);
         of1.write(w);
         of1.write(w);

         of2.write(w);
         of2.write(w);
         of2.write(c);
         of2.write(c);

         of1.close();
         of2.close();
}
  $ hexdump -C /tmp/1f.txt
00000000  0f 00 00 00 53 61 61 74  61 6e 61 20 70 65 72 6b
|....Saatana perk|
00000010  65 6c 65 0f 00 00 00 53  61 61 74 61 6e 61 20 70
|ele....Saatana p|
00000020  65 72 6b 65 6c 65 0f 00  00 00 53 00 61 00 61 00
|erkele....S.a.a.|
00000030  74 00 61 00 6e 00 61 00  20 00 70 00 65 00 72 00
|t.a.n.a. .p.e.r.|
00000040  6b 00 65 00 6c 00 65 00  0f 00 00 00 53 00 61 00 
|k.e.l.e.....S.a.|
00000050  61 00 74 00 61 00 6e 00  61 00 20 00 70 00 65 00
|a.t.a.n.a. .p.e.|
00000060  72 00 6b 00 65 00 6c 00  65 00
|r.k.e.l.e.|
  $ hexdump -C /tmp/2f.txt
00000000  0f 00 00 00 53 00 61 00  61 00 74 00 61 00 6e 00
|....S.a.a.t.a.n.|
00000010  61 00 20 00 70 00 65 00  72 00 6b 00 65 00 6c 00
|a. .p.e.r.k.e.l.|
00000020  65 00 0f 00 00 00 53 00  61 00 61 00 74 00 61 00
|e.....S.a.a.t.a.|
00000030  6e 00 61 00 20 00 70 00  65 00 72 00 6b 00 65 00
|n.a. .p.e.r.k.e.|
00000040  6c 00 65 00 0f 00 00 00  53 61 61 74 61 6e 61 20
|l.e.....Saatana |
00000050  70 65 72 6b 65 6c 65 0f  00 00 00 53 61 61 74 61
|perkele....Saata|
00000060  6e 61 20 70 65 72 6b 65  6c 65
|na perkele|
  $ cat /tmp/1f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $



HOWEVER, I have a couple of issues here.

First of all, it looks like we don't have implicit conversion, but 
rather that the strings get copied to the output stream byte by byte!

Now, the standard says that you are not allowed to have illegal octets, 
or characters in a UTF file. Of any width.

Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!!

Further, seems like write puts the BOM before every string. That is 
definitely illegal.

(The operating system let me "cat" the files to screen, and tried its 
best to show them in a reasonable way (as you see above). But it really 
would not have had to.)

---

What we could have happen is, that the first string output to the 
stream, causes the stream to choose the stream UTF width (and 
theoretically the endianness, too). (This is what the OS does when 
choosing whether to open in byte width or wider, according to linux 
documentation.) And whenever somebody tries to stuff "the wrong" crap 
there, do either of the following:

  - implicitly convert the string to the right UTF
  - throw error

---

While D is in pre-1.0, I think we should at first decide that streams 
have to be opened with the UTF specified. Since the compiler should know 
the type of all the strings (see my other post today), it can then 
insert code for the appropriate runtime conversion.

Since the compiler knows the type of string, it might be suggested that 
the first output string defines the stream type.

I think it would be unwise. But _only_ for the same reason D demands a 
default in case, denies a semicolon right after an if clause, etc.

That is, to help the programmer not to shoot his foot. There is _no_ 
valid reason why it couldn't be set by the first string automatically.

HOWEVER, good table manners ask for reasonable defaults where at all 
possible. Such a default would be the UTF width and endianness that is 
"natural" on the particular platform.

(If D is ever ported to platfrom that doesn't handle UTF, then the 
Natural Default of course is None. That is, one has to manually choose 
when opening the stream.)

---

Similarly, if we want to implement our INPUT streams correctly, then 
they should _definitively_ choose their UTF type before the first time 
the application gets to read from the stream.

FOR THE SITUATIONS where one either has to already process the first 
octet before enough of the stream has been seen to know which UTF type 
it is, THEN in THAT CASE an input stream of e.g. UBYTE should be 
mandatory to use instead. Or more to the point, UTF streams should not 
be used then.

---

I have to remark on "since the compiler knows the type of string" above. 
Since this is such rocket science, DO REMEMBER that it "knows" because 
it looks at the TYPE (as in char[], wchar[], dchar[]) and not the 
CONTENTS of the string at that time.

:-) Just to keep apples and oranges i order...

---

What I called BOM above, does incidentally not look like it should, in 
the above file dumps anyway.

---

Before we continue, I think everybody should read the following:

www.unicode.org/faq/

                             -- ** --
Nov 18 2005
next sibling parent reply =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:
Georg Wrede wrote:
<snip>
  $ cat /tmp/1f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $
 

 
 HOWEVER, I have a couple of issues here.
I think you have some serious issues with the political correctness of the message here ;)
Nov 18 2005
parent Georg Wrede <georg.wrede nospam.org> writes:
Jari-Matti Mäkelä wrote:
 Georg Wrede wrote:
 <snip>
 
  $ cat /tmp/1f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $



 HOWEVER, I have a couple of issues here.
I think you have some serious issues with the political correctness of the message here ;)
ROFL ! I trust the non-Finns use inborn Duck Typing. If it doesn't look like obscenities, then it isn't. :-) Or maybe I have an encryptor that turns "Hail Mary" into that string. Or maybe repeatedly drawing a picture out of iron wire has my hands bleeding, and I'm getting pissed off here. Maybe I should switch to clay models... But hey, it was USASCII all over!
Nov 18 2005
prev sibling next sibling parent reply =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:
Georg Wrede wrote:
<snip>
 HOWEVER, I have a couple of issues here.
 
 First of all, it looks like we don't have implicit conversion, but 
 rather that the strings get copied to the output stream byte by byte!
write(...) writes the source value to the stream byte by byte.
 
 Now, the standard says that you are not allowed to have illegal octets, 
 or characters in a UTF file. Of any width.
 
 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
 file!!!!!
 
 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.
 
That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.
 What we could have happen is, that the first string output to the 
 stream, causes the stream to choose the stream UTF width (and 
 theoretically the endianness, too). (This is what the OS does when 
 choosing whether to open in byte width or wider, according to linux 
 documentation.) And whenever somebody tries to stuff "the wrong" crap 
 there, do either of the following:
 
  - implicitly convert the string to the right UTF
  - throw error
I think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode?
Nov 18 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Jari-Matti Mäkelä wrote:
 Georg Wrede wrote:
 <snip>
 
 HOWEVER, I have a couple of issues here.

 First of all, it looks like we don't have implicit conversion, but 
 rather that the strings get copied to the output stream byte by byte!
write(...) writes the source value to the stream byte by byte.
Oops. Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR we should have different kinds of streams. Some of which would be UTF savvy, some text, some void streams.
 Now, the standard says that you are not allowed to have illegal 
 octets, or characters in a UTF file. Of any width.

 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
 file!!!!!

 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.
That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.
We should have a set of different streams. Hey, Java has like millions to choose from! You can even join them to get, say, a "buffered, character-code-translating, rot-13, foo-izing" stream!!!
 What we could have happen is, that the first string output to the 
 stream, causes the stream to choose the stream UTF width (and 
 theoretically the endianness, too). (This is what the OS does when 
 choosing whether to open in byte width or wider, according to linux 
 documentation.) And whenever somebody tries to stuff "the wrong" crap 
 there, do either of the following:

  - implicitly convert the string to the right UTF
  - throw error
I think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode?
Of course!
Nov 18 2005
parent Georg Wrede <georg.wrede nospam.org> writes:
Georg Wrede wrote:
 Jari-Matti Mäkelä wrote:
 Georg Wrede wrote: <snip>
 
 HOWEVER, I have a couple of issues here.
 
 First of all, it looks like we don't have implicit conversion,
 but rather that the strings get copied to the output stream byte
 by byte!
write(...) writes the source value to the stream byte by byte.
Oops. Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR we should have different kinds of streams. Some of which would be UTF savvy, some text, some void streams.
 Now, the standard says that you are not allowed to have illegal 
 octets, or characters in a UTF file. Of any width.
 
 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the
 same file!!!!!
 
 Further, seems like write puts the BOM before every string. That
 is definitely illegal.
That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.
We should have a set of different streams. Hey, Java has like millions to choose from! You can even join them to get, say, a "buffered, character-code-translating, rot-13, foo-izing" stream!!!
At first the Java style where one chains streams seemed terribly inefficient. But later I understood that it wasn't, it just looked like inefficient. We could have raw input and output streams, and then a set of conversion streams (or actually filters), like this: OutStream os = new OutStream("foo"); // opens a raw outstream StreamBuffer sb = new StreamBuffer(os); ConvStream out = new ConvStream(UTF8, ISO8859-15, sb); ... char[] mytext = "kjsldkfjlskdfjslkd"; fwritefln(out, mytext); Since StreamBuffer eventually outputs everything, one doesn't even have to worry about the buffer getting filled up in "mid-char" if doing output in UTF (not the example above), since the rest of the char gets output later anyhow. I think this looks clean and easy to maintain (for the library maintainer), and it's use is starightforward, flexible, and coceptually clear. This would also bring tighter locality to the whole input/output system, since every stream only does its own thing. With this setup it also becomes much easier for the programmer to write his own stream filters, without having to become a Stream Guru first.
Nov 23 2005
prev sibling parent Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Georg Wrede wrote:
 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.
 
...
 
 What I called BOM above, does incidentally not look like it should, in 
 the above file dumps anyway.
 
Because that's not the BOM, it's (an int with) the string length... -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 18 2005