digitalmars.D.bugs - A couple of issues with UTF

Georg Wrede (120/120) Nov 18 2005 Haa, look ma, no hands!

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (4/13) Nov 18 2005 I think you have some serious issues with the political correctness of

Georg Wrede (9/25) Nov 18 2005 ROFL !

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (7/30) Nov 18 2005 write(...) writes the source value to the stream byte by byte.

Georg Wrede (11/45) Nov 18 2005 Oops.

Georg Wrede (22/58) Nov 23 2005 At first the Java style where one chains streams seemed terribly

Bruno Medeiros (7/14) Nov 18 2005 Because that's not the BOM, it's (an int with) the string length...

Georg Wrede <georg nospam.org> writes:

Haa, look ma, no hands!

So we have implicit UTF conversion.
The following compiles ok!



$ cat outtest2fixed.d
import std.stream;

void main()
{
          char[] c = "Saatana perkele"c;
         wchar[] w = "Saatana perkele"w;

         File of1 = new File;
         File of2 = new File;

         of1.create("/tmp/1f.txt");
         of2.create("/tmp/2f.txt");

         of1.write(c);
         of1.write(c);
         of1.write(w);
         of1.write(w);

         of2.write(w);
         of2.write(w);
         of2.write(c);
         of2.write(c);

         of1.close();
         of2.close();
}
  $ hexdump -C /tmp/1f.txt
00000000  0f 00 00 00 53 61 61 74  61 6e 61 20 70 65 72 6b
|....Saatana perk|
00000010  65 6c 65 0f 00 00 00 53  61 61 74 61 6e 61 20 70
|ele....Saatana p|
00000020  65 72 6b 65 6c 65 0f 00  00 00 53 00 61 00 61 00
|erkele....S.a.a.|
00000030  74 00 61 00 6e 00 61 00  20 00 70 00 65 00 72 00
|t.a.n.a. .p.e.r.|
00000040  6b 00 65 00 6c 00 65 00  0f 00 00 00 53 00 61 00 
|k.e.l.e.....S.a.|
00000050  61 00 74 00 61 00 6e 00  61 00 20 00 70 00 65 00
|a.t.a.n.a. .p.e.|
00000060  72 00 6b 00 65 00 6c 00  65 00
|r.k.e.l.e.|
  $ hexdump -C /tmp/2f.txt
00000000  0f 00 00 00 53 00 61 00  61 00 74 00 61 00 6e 00
|....S.a.a.t.a.n.|
00000010  61 00 20 00 70 00 65 00  72 00 6b 00 65 00 6c 00
|a. .p.e.r.k.e.l.|
00000020  65 00 0f 00 00 00 53 00  61 00 61 00 74 00 61 00
|e.....S.a.a.t.a.|
00000030  6e 00 61 00 20 00 70 00  65 00 72 00 6b 00 65 00
|n.a. .p.e.r.k.e.|
00000040  6c 00 65 00 0f 00 00 00  53 61 61 74 61 6e 61 20
|l.e.....Saatana |
00000050  70 65 72 6b 65 6c 65 0f  00 00 00 53 61 61 74 61
|perkele....Saata|
00000060  6e 61 20 70 65 72 6b 65  6c 65
|na perkele|
  $ cat /tmp/1f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $



HOWEVER, I have a couple of issues here.

First of all, it looks like we don't have implicit conversion, but 
rather that the strings get copied to the output stream byte by byte!

Now, the standard says that you are not allowed to have illegal octets, 
or characters in a UTF file. Of any width.

Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!!

Further, seems like write puts the BOM before every string. That is 
definitely illegal.

(The operating system let me "cat" the files to screen, and tried its 
best to show them in a reasonable way (as you see above). But it really 
would not have had to.)

---

What we could have happen is, that the first string output to the 
stream, causes the stream to choose the stream UTF width (and 
theoretically the endianness, too). (This is what the OS does when 
choosing whether to open in byte width or wider, according to linux 
documentation.) And whenever somebody tries to stuff "the wrong" crap 
there, do either of the following:

  - implicitly convert the string to the right UTF
  - throw error

---

While D is in pre-1.0, I think we should at first decide that streams 
have to be opened with the UTF specified. Since the compiler should know 
the type of all the strings (see my other post today), it can then 
insert code for the appropriate runtime conversion.

Since the compiler knows the type of string, it might be suggested that 
the first output string defines the stream type.

I think it would be unwise. But _only_ for the same reason D demands a 
default in case, denies a semicolon right after an if clause, etc.

That is, to help the programmer not to shoot his foot. There is _no_ 
valid reason why it couldn't be set by the first string automatically.

HOWEVER, good table manners ask for reasonable defaults where at all 
possible. Such a default would be the UTF width and endianness that is 
"natural" on the particular platform.

(If D is ever ported to platfrom that doesn't handle UTF, then the 
Natural Default of course is None. That is, one has to manually choose 
when opening the stream.)

---

Similarly, if we want to implement our INPUT streams correctly, then 
they should _definitively_ choose their UTF type before the first time 
the application gets to read from the stream.

FOR THE SITUATIONS where one either has to already process the first 
octet before enough of the stream has been seen to know which UTF type 
it is, THEN in THAT CASE an input stream of e.g. UBYTE should be 
mandatory to use instead. Or more to the point, UTF streams should not 
be used then.

---

I have to remark on "since the compiler knows the type of string" above. 
Since this is such rocket science, DO REMEMBER that it "knows" because 
it looks at the TYPE (as in char[], wchar[], dchar[]) and not the 
CONTENTS of the string at that time.

:-) Just to keep apples and oranges i order...

---

What I called BOM above, does incidentally not look like it should, in 
the above file dumps anyway.

---

Before we continue, I think everybody should read the following:

www.unicode.org/faq/

                             -- ** --

Nov 18 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

Georg Wrede wrote:
<snip>
  $ cat /tmp/1f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $
 

 
 HOWEVER, I have a couple of issues here.

I think you have some serious issues with the political correctness of 
the message here ;)

Nov 18 2005

Georg Wrede <georg.wrede nospam.org> writes:

Jari-Matti M�kel� wrote:
 Georg Wrede wrote:
 <snip>
 
  $ cat /tmp/1f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $ cat /tmp/2f.txt
 Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele
  $



 HOWEVER, I have a couple of issues here.

 
 
 I think you have some serious issues with the political 
 correctness of the message here ;)

ROFL !

I trust the non-Finns use inborn Duck Typing. If it doesn't look like 
obscenities, then it isn't.   :-)

Or maybe I have an encryptor that turns "Hail Mary" into that string.

Or maybe repeatedly drawing a picture out of iron wire has my hands 
bleeding, and I'm getting pissed off here. Maybe I should switch to clay 
models...

But hey, it was USASCII all over!

Nov 18 2005

=?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:

Georg Wrede wrote:
<snip>
 HOWEVER, I have a couple of issues here.
 
 First of all, it looks like we don't have implicit conversion, but 
 rather that the strings get copied to the output stream byte by byte!

write(...) writes the source value to the stream byte by byte.

 
 Now, the standard says that you are not allowed to have illegal octets, 
 or characters in a UTF file. Of any width.
 
 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
 file!!!!!
 
 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.
 

That is illegal if you're trying to create a valid _text_ file. AFAIK 
the normal File is just a regular OutputStream, it doesn't care about UTF.

 What we could have happen is, that the first string output to the 
 stream, causes the stream to choose the stream UTF width (and 
 theoretically the endianness, too). (This is what the OS does when 
 choosing whether to open in byte width or wider, according to linux 
 documentation.) And whenever somebody tries to stuff "the wrong" crap 
 there, do either of the following:
 
  - implicitly convert the string to the right UTF
  - throw error

I think this should not be the default for all streams. Maybe it would 
be better to have a new TextStream class that supports full Unicode?

Nov 18 2005

Georg Wrede <georg.wrede nospam.org> writes:

Jari-Matti M�kel� wrote:
 Georg Wrede wrote:
 <snip>
 
 HOWEVER, I have a couple of issues here.

 First of all, it looks like we don't have implicit conversion, but 
 rather that the strings get copied to the output stream byte by byte!

 
 
 write(...) writes the source value to the stream byte by byte.

Oops.

Well, in that case, we should give it uchar[] when we don't want 
fanciness. Or void[], right!

Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR 
we should have different kinds of streams. Some of which would be UTF 
savvy, some text, some void streams.

 Now, the standard says that you are not allowed to have illegal 
 octets, or characters in a UTF file. Of any width.

 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same 
 file!!!!!

 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.

 
 That is illegal if you're trying to create a valid _text_ file. AFAIK
 the normal File is just a regular OutputStream, it doesn't care about
 UTF.

We should have a set of different streams. Hey, Java has like millions 
to choose from! You can even join them to get, say, a "buffered, 
character-code-translating, rot-13, foo-izing" stream!!!

 What we could have happen is, that the first string output to the 
 stream, causes the stream to choose the stream UTF width (and 
 theoretically the endianness, too). (This is what the OS does when 
 choosing whether to open in byte width or wider, according to linux 
 documentation.) And whenever somebody tries to stuff "the wrong" crap 
 there, do either of the following:

  - implicitly convert the string to the right UTF
  - throw error

 
 I think this should not be the default for all streams. Maybe it would 
 be better to have a new TextStream class that supports full Unicode?

Of course!

Nov 18 2005

Georg Wrede <georg.wrede nospam.org> writes:

Georg Wrede wrote:
 Jari-Matti M�kel� wrote:
 Georg Wrede wrote: <snip>
 
 HOWEVER, I have a couple of issues here.
 
 First of all, it looks like we don't have implicit conversion,
 but rather that the strings get copied to the output stream byte
 by byte!

 
 write(...) writes the source value to the stream byte by byte.

 
 Oops.
 
 Well, in that case, we should give it uchar[] when we don't want 
 fanciness. Or void[], right!
 
 Which should make it EITHER illegal to write [c/w/d]char[] to it --
 OR we should have different kinds of streams. Some of which would be
 UTF savvy, some text, some void streams.
 
 Now, the standard says that you are not allowed to have illegal 
 octets, or characters in a UTF file. Of any width.
 
 Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the
 same file!!!!!
 
 Further, seems like write puts the BOM before every string. That
 is definitely illegal.

 
 That is illegal if you're trying to create a valid _text_ file.
 AFAIK the normal File is just a regular OutputStream, it doesn't
 care about UTF.

 
 We should have a set of different streams. Hey, Java has like
 millions to choose from! You can even join them to get, say, a
 "buffered, character-code-translating, rot-13, foo-izing" stream!!!

At first the Java style where one chains streams seemed terribly 
inefficient. But later I understood that it wasn't, it just looked like 
inefficient. We could have raw input and output streams, and then a set 
of conversion streams (or actually filters), like this:

OutStream os = new OutStream("foo");  // opens a raw outstream
StreamBuffer sb = new StreamBuffer(os);
ConvStream out = new ConvStream(UTF8, ISO8859-15, sb);
...
char[] mytext = "kjsldkfjlskdfjslkd";
fwritefln(out, mytext);

Since StreamBuffer eventually outputs everything, one doesn't even have 
to worry about the buffer getting filled up in "mid-char" if doing 
output in UTF (not the example above), since the rest of the char gets 
output later anyhow.

I think this looks clean and easy to maintain (for the library 
maintainer), and it's use is starightforward, flexible, and coceptually 
clear.

This would also bring tighter locality to the whole input/output system, 
since every stream only does its own thing.

With this setup it also becomes much easier for the programmer to write 
his own stream filters, without having to become a Stream Guru first.

Nov 23 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Georg Wrede wrote:
 Further, seems like write puts the BOM before every string. That is 
 definitely illegal.
 

...
 
 What I called BOM above, does incidentally not look like it should, in 
 the above file dumps anyway.
 

Because that's not the BOM, it's (an int with) the string length...

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 18 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - A couple of issues with UTF