digitalmars.D - Non UTF characters in comments

Vathix (9/9) Jan 29 2005 I'm not sure if allowing non UTF characters in comments is such a good

=?ISO-8859-15?Q?Thomas_K=FChne?= (23/23) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----

Vathix (1/2) Jan 29 2005 Looks like DMD allows that in comments and I don't think it's a good ide...

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (20/32) Jan 29 2005 The current lexer just skips all bytes in comments,

=?ISO-8859-15?Q?Thomas_K=FChne?= (39/39) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/12) Jan 29 2005 I just think it should treat comments the

=?ISO-8859-15?Q?Thomas_K=FChne?= (40/40) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/15) Jan 29 2005 can be changed into:

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/11) Jan 29 2005 Hilarious, the new patch made phobos fail:
=?ISO-8859-15?Q?Thomas_K=FChne?= (38/38) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/16) Jan 29 2005 Yes, the idea is that it will not be valid and

Andrew Fedoniouk (10/19) Jan 29 2005 What does it mean "non UTF character" ?

Sebastian Beschke (4/11) Jan 29 2005 Invalid sequences *are* possible by using codepoints in the table that

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (4/6) Jan 29 2005 A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
Andrew Fedoniouk (11/22) Jan 29 2005 I see. Thanks, Sebastian.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/5) Jan 29 2005 A bug in the current DMD makes it allow almost everything, in comments.

Andrew Fedoniouk (8/9) Jan 29 2005 It's a feature rather than a bug.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (5/7) Jan 29 2005 :-)

Andrew Fedoniouk (17/18) Jan 29 2005 Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" ...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/10) Jan 30 2005 XHTML *is* both HTML and XML at once, which is why it is so useful...

Derek (8/20) Jan 29 2005 One could argue that comments are not actual D source code ;-)

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (4/8) Jan 30 2005 It makes other stuff like parsers easier, if all .d files are valid UTF.

Vathix (3/7) Jan 29 2005 A value in the file that causes std.utf functions to throw an exception ...

Walter (3/12) Jan 29 2005 Technically it's an error to have non-UTF characters anywhere in the sou...
Brian Chapman (4/13) Jan 29 2005 I don't know what your situation/eviroment is, but you could just try

Vathix <vathix dprogramming.com> writes:

I'm not sure if allowing non UTF characters in comments is such a good  
idea. It seems to be complicating my parser, and it will probably  
complicate other things like text/code editors. What is supposed to happen  
when a non UTF character is encountered? Should it display a question  
mark, display nothing, use the current code page? What if the editor  
doesn't know about D's comments?
I might not have mentioned this, but since D is suppsed to be easily  
parsed, this might be an issue; a special case.
- Chris

Jan 29 2005

=?ISO-8859-15?Q?Thomas_K=FChne?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vathix wrote:
| I'm not sure if allowing non UTF characters in comments is such a good
| idea. It seems to be complicating my parser, and it will probably
| complicate other things like text/code editors. What is supposed to
| happen  when a non UTF character is encountered? Should it display a
| question  mark, display nothing, use the current code page? What if the
| editor  doesn't know about D's comments?

Maybe I am missreading your post.
Are you trying to use 2 different encodings in one file?

Concerning Unicode: you are supposed to display the glyph of U+FFFD for
all character's that can't be displayed by other means - e.g. a generic
glyph displaying the codepoint or the code range. (Depending on your
situation you might also use U+FFFC).

http://www.unicode.org

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+67k3w+/yD4P9tIRAvl8AJ92uZbHz2oqLyJdoRH1grDhB854VACfQ1Aq
aczWIaLE5GTW9qE1vbPAceo=
=zdro
-----END PGP SIGNATURE-----

Jan 29 2005

Vathix <vathix dprogramming.com> writes:

 Are you trying to use 2 different encodings in one file?

Looks like DMD allows that in comments and I don't think it's a good idea.

Jan 29 2005

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Vathix wrote:

 I'm not sure if allowing non UTF characters in comments is such a good idea.



 Are you trying to use 2 different encodings in one file?

 
 Looks like DMD allows that in comments and I don't think it's a good idea.

The current lexer just skips all bytes in comments,
until it finds the end of the current comment run.

And that's probably not a good idea, but simpler...
(otherwise you would have to check all non-ASCIIs)


You still cannot use such invalid UTF sequences for
anything such as identifiers or strings, though...

Just consider it a bug in the current DMD front-end ?
(i.e. don't abuse this, since it'll be fixed one day)


Says http://www.digitalmars.com/d/lex.html:

  D source text can be in one of the following formats:
 
     * ASCII
     * UTF-8
     * UTF-16BE
     * UTF-16LE
     * UTF-32BE
     * UTF-32LE 

This implies that *all* source input should be
valid UTF (since ASCII is also valid as UTF-8)

It *should* just stop dead when it finds one, e.g.:
error("invalid UTF-8 sequence");
		
--anders

PS. A nice feature would be to have the frontend
     convert from other encodings as well, but it
     would just add unneeded complexity since there
     are a *lot* of possible encodings out there (200)

Jan 29 2005

=?ISO-8859-15?Q?Thomas_K=FChne?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Bj�rklund wrote:

| Vathix wrote:
|
|>>> I'm not sure if allowing non UTF characters in comments is such a
|>>> good idea.
|
|
|>> Are you trying to use 2 different encodings in one file?
|>
|>
|> Looks like DMD allows that in comments and I don't think it's a good
|> idea.
|
|
| The current lexer just skips all bytes in comments,
| until it finds the end of the current comment run.

[snip]

| It *should* just stop dead when it finds one, e.g.:
| error("invalid UTF-8 sequence");

I dont think the compiler should try to check the comment's content.

What is an "invalid" UTF-8 sequence?

How would you e.g. handle Java's pre 1.5 "customised" UTF-8?
(endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints)

- - Granted, we might agree on overlong  sequences, but how about
unassigned codepoints?
- - Has the input to be normalized? What normalization?
- - Are you going to enforce the full Unicode spec? What spec version?
- - How about the PUA?
- - How about >U-11FFFD?
- - Is U-FFFD/U-FFFC allowed?

Thomas


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+/SF3w+/yD4P9tIRArTkAJ9KOvumTbFe+2OdEbMwZSvNqCb3rACgqPcl
xSZ2C0Vk2bUsVHsqZUlKwQI=
=wAIW
-----END PGP SIGNATURE-----

Jan 29 2005

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Thomas K�hne wrote:

 | It *should* just stop dead when it finds one, e.g.:
 | error("invalid UTF-8 sequence");
 
 I dont think the compiler should try to check the comment's content.

Why not ? It checks the rest of the file...

 What is an "invalid" UTF-8 sequence?

I just think it should treat comments the
same way it treats identifiers and literals ?

That is, call: utf_decodeChar and follow
whatever error that it returns... (utf.c)

--anders

Jan 29 2005

=?ISO-8859-15?Q?Thomas_K=FChne?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Bj�rklund wrote:

| Thomas K�hne wrote:
|
|> | It *should* just stop dead when it finds one, e.g.:
|> | error("invalid UTF-8 sequence");
|>
|> I dont think the compiler should try to check the comment's content.
|
|
| Why not ? It checks the rest of the file...
|
|> What is an "invalid" UTF-8 sequence?
|
| I just think it should treat comments the
| same way it treats identifiers and literals ?
|
| That is, call: utf_decodeChar and follow
| whatever error that it returns... (utf.c)

The current check for identifiers are:
1) shortes possible byte sequence for UTF-8
OK

2) no loone surrogate part
That might clash with pre 1.5 Java output.
This is a Java bug, thus can be ignored.

3) c <= 0x10FFFF
OK

4) c != 0xFFFE && c != 0xFFFF
That's the only check I reject. Those codepoints can occure if a
non-Unicode document is converted to UTF encoded Unicode. Inside of
comments they shouldn't stop the parsing.

Those checks above are - except for the 4th - reasonable for comments.

Thomas


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+/zW3w+/yD4P9tIRAiGXAJ0T0A9Yj5FQMGR+aB60C3hGVU25UgCeKl7W
GcWKf1XK2ZLTmgh+BjasRhs=
=9oAb
-----END PGP SIGNATURE-----

Jan 29 2005

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Thomas K�hne wrote:

 4) c != 0xFFFE && c != 0xFFFF
 That's the only check I reject. Those codepoints can occure if a
 non-Unicode document is converted to UTF encoded Unicode. Inside of
 comments they shouldn't stop the parsing.
 
 Those checks above are - except for the 4th - reasonable for comments.

If needed, that can be hacked around for comments, for those two.

 	s = utf_decodeChar(octet, ndigits, &idx, &c);
 	if (s || idx != ndigits)

can be changed into:

	s = utf_decodeChar(octet, ndigits, &idx, &c);
	if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)

Would that make it more reasonable ? (have the DMD patch ready...)

--anders

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

I wrote:

 Looks like DMD allows that in comments and I don't think it's a good
 idea.


[...]
 Would that make it more reasonable ? (have the DMD patch ready...)

Hilarious, the new patch made phobos fail:

 ../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequence

Due to this little comment line, from GDC:

    Modified by David Friedman, October 2004 (applied patches from Anders F
Bj�rklund.)

(as the � here was in Latin-1, you see...)

--anders

Jan 29 2005

=?ISO-8859-15?Q?Thomas_K=FChne?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Bj�rklund schrieb:
| Thomas K�hne wrote:
|
|> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those
|> codepoints can occure if a non-Unicode document is converted to UTF
|> encoded Unicode. Inside of comments they shouldn't stop the
|> parsing.
|>
|> Those checks above are - except for the 4th - reasonable for
|> comments.
|
|
| If needed, that can be hacked around for comments, for those two.
|
|> s = utf_decodeChar(octet, ndigits, &idx, &c);
|> if (s || idx != ndigits)
|
|
| can be changed into:
|
| s = utf_decodeChar(octet, ndigits, &idx, &c);
| if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
|
| Would that make it more reasonable ? (have the DMD patch ready...)

Have a look at utf_decodeChar:
dmd/utf.c:92 and dmd/utf.c:183 ;)

While looking through utf.c I noticed that UTF-32 decoding doesn't
undergo any checks. I'll write a bunch of test cases for all those
encoding issues tomorrow.

Thomas

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB/BgZ3w+/yD4P9tIRAviMAJwM/ZKfCMNEefi1ij3SfIPP0bz5OwCfQIq9
nFU3UMQ4FrQcwv1is2KoNEc=
=mgqS
-----END PGP SIGNATURE-----

Jan 29 2005

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Thomas K�hne wrote:

 | can be changed into:
 |
 | s = utf_decodeChar(octet, ndigits, &idx, &c);
 | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
 |
 | Would that make it more reasonable ? (have the DMD patch ready...)
 
 Have a look at utf_decodeChar:
 dmd/utf.c:92 and dmd/utf.c:183 ;)

Yes, the idea is that it will not be valid and
return string "invalid UTF-8 sequence", which
is then ignored because the char is FFFE/F...
(all input is converted to UTF-8 before lexer)

The patch is in the digitalmars.D.bugs group.

--anders

Jan 29 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

What does it mean "non UTF character" ?
UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
charactes (code points).

So "non UTF character" sounds for me as "non UNICODE character". And what is 
that?
Some new alphabet?

Andrew Fedoniouk.
http://terrainformatica.com



"Vathix" <vathix dprogramming.com> wrote in message 
news:opsldc5xwrkcck4r esi...
 I'm not sure if allowing non UTF characters in comments is such a good 
 idea. It seems to be complicating my parser, and it will probably 
 complicate other things like text/code editors. What is supposed to happen 
 when a non UTF character is encountered? Should it display a question 
 mark, display nothing, use the current code page? What if the editor 
 doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily 
 parsed, this might be an issue; a special case.
 - Chris

Jan 29 2005

Sebastian Beschke <s.beschke gmx.de> writes:

Andrew Fedoniouk schrieb:
 What does it mean "non UTF character" ?
 UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
 charactes (code points).
 
 So "non UTF character" sounds for me as "non UNICODE character". And what is 
 that?
 Some new alphabet?

Invalid sequences *are* possible by using codepoints in the table that 
aren't defined, or by misforming UTF-8 or UTF-16 sequences.

-Sebastian

Jan 29 2005

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Sebastian Beschke wrote:

 Invalid sequences *are* possible by using codepoints in the table that 
 aren't defined, or by misforming UTF-8 or UTF-16 sequences.

A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
That'll give you "invalid UTF-8 sequence", for everything outside ASCII.

--anders

Jan 29 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

I see. Thanks, Sebastian.

But text with erroneous utf sequences (not "non UTF character", sic! ) will 
not be compiled anyway.

So for "... other things like text/code editors. What is supposed to happen
when a non UTF character is encountered?..." editor (good one) should mark 
them as
"bad string literal" or the like.

Andrew Fedoniouk.
http://terrainformatica.com


"Sebastian Beschke" <s.beschke gmx.de> wrote in message 
news:ctgkta$48a$1 digitaldaemon.com...
 Andrew Fedoniouk schrieb:
 What does it mean "non UTF character" ?
 UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
 charactes (code points).

 So "non UTF character" sounds for me as "non UNICODE character". And what 
 is that?
 Some new alphabet?

 Invalid sequences *are* possible by using codepoints in the table that 
 aren't defined, or by misforming UTF-8 or UTF-16 sequences.

 -Sebastian

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 But text with erroneous utf sequences (not "non UTF character", sic! ) will 
 not be compiled anyway.

A bug in the current DMD makes it allow almost everything, in comments.

--anders

Jan 29 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 A bug in the current DMD makes it allow almost everything, in comments.

It's a feature rather than a bug.

Preparation for attributed programming I guess. With option to include 
binary data inline :)
I can imagine properties/methods having its own descriptional GIFs given in 
source text as bytes. Le Cauchemar!

BTW: Are there any ports of png/jpeg/gif libs in D?

Andrew Fedoniouk.
http://terrainformatica.com

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 Preparation for attributed programming I guess. With option to include 
 binary data inline :)

:-)

No, it's a bug. D source code is supposed to be valid UTF-8/16/32.

Ideally, the HTML used should be made to be valid XHTML as well...

--anders

Jan 29 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 Ideally, the HTML used should be made to be valid XHTML as well...

Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as 
is HTML.
First browser which will enforce showing *only valid* or even only 
well-formed docs will die for the market for many reasons.

Even XHTML standard say that UA (user agent - browser) should try to show 
patrial content.
Strictly speaking, partial content of XML doc is not well formed XML thus 
invalid.

So browsers will show invalid XHTML anyway. So there will not be a strong 
motivation to use valid XHTML,
so XHTML will loose its 'X' and become just HTML v.5,6, etc....

And I am silent yet about CSS grammar where whitespace is an operator with 
different meaning in different places.... Parsing nightmare.
I love this game...

What kind of grammar D uses, is it context free, BTW?

Andrew Fedoniouk.
http://terrainformatica.com

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as 
 is HTML.

XHTML *is* both HTML and XML at once, which is why it is so useful...
Just as UTF-8 is both ASCII and Unicode at once, best of both worlds.

 First browser which will enforce showing *only valid* or even only 
 well-formed docs will die for the market for many reasons.

Just because certain browsers display it, is no reason for nonvalid 
markup. And it's easy to verify too, using http://validator.w3.org/ ?

--anders

Jan 30 2005

Derek <derek psyc.ward> writes:

On Sun, 30 Jan 2005 01:11:47 +0100, Anders F Bj�rklund wrote:

 Andrew Fedoniouk wrote:
 
 Preparation for attributed programming I guess. With option to include 
 binary data inline :)

 
 :-)
 
 No, it's a bug. D source code is supposed to be valid UTF-8/16/32.
 
 Ideally, the HTML used should be made to be valid XHTML as well...
 
 --anders

One could argue that comments are not actual D source code ;-) 

Consider: does the compiler needs the comments to create the application?
If the comments are not for the compiler, then why should it care what's in
the comments?

-- 
Derek
Melbourne, Australia

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek wrote:

 One could argue that comments are not actual D source code ;-) 
 Consider: does the compiler needs the comments to create the application?

The source code is not just for the compiler. (I meant "compiler input")

 If the comments are not for the compiler, then why should it care what's in
 the comments?

It makes other stuff like parsers easier, if all .d files are valid UTF.

--anders

Jan 30 2005

Vathix <vathix dprogramming.com> writes:

 So "non UTF character" sounds for me as "non UNICODE character". And  
 what is
 that?
 Some new alphabet?

A value in the file that causes std.utf functions to throw an exception  
because it's invalid. I'm not good at this stuff and I don't know all the  
proper terminology.

Jan 29 2005

"Walter" <newshound digitalmars.com> writes:

"Vathix" <vathix dprogramming.com> wrote in message
news:opsldc5xwrkcck4r esi...
 I'm not sure if allowing non UTF characters in comments is such a good
 idea. It seems to be complicating my parser, and it will probably
 complicate other things like text/code editors. What is supposed to happen
 when a non UTF character is encountered? Should it display a question
 mark, display nothing, use the current code page? What if the editor
 doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily
 parsed, this might be an issue; a special case.
 - Chris

Technically it's an error to have non-UTF characters anywhere in the source.

Jan 29 2005

Brian Chapman <nospam-for-brian see-post-for-address.net> writes:

On 2005-01-29 08:57:23 -0600, Vathix <vathix dprogramming.com> said:

 I'm not sure if allowing non UTF characters in comments is such a good  
 idea. It seems to be complicating my parser, and it will probably  
 complicate other things like text/code editors. What is supposed to 
 happen  when a non UTF character is encountered? Should it display a 
 question  mark, display nothing, use the current code page? What if the 
 editor  doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily  
 parsed, this might be an issue; a special case.
 - Chris

I don't know what your situation/eviroment is, but you could just try 
piping the sources files through the GNU iconv utility to convert the 
files to whatever encoding you want.

Jan 29 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Non UTF characters in comments