www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Non UTF characters in comments

reply Vathix <vathix dprogramming.com> writes:
I'm not sure if allowing non UTF characters in comments is such a good  
idea. It seems to be complicating my parser, and it will probably  
complicate other things like text/code editors. What is supposed to happen  
when a non UTF character is encountered? Should it display a question  
mark, display nothing, use the current code page? What if the editor  
doesn't know about D's comments?
I might not have mentioned this, but since D is suppsed to be easily  
parsed, this might be an issue; a special case.
- Chris
Jan 29 2005
next sibling parent reply =?ISO-8859-15?Q?Thomas_K=FChne?= writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vathix wrote:
| I'm not sure if allowing non UTF characters in comments is such a good
| idea. It seems to be complicating my parser, and it will probably
| complicate other things like text/code editors. What is supposed to
| happen  when a non UTF character is encountered? Should it display a
| question  mark, display nothing, use the current code page? What if the
| editor  doesn't know about D's comments?

Maybe I am missreading your post.
Are you trying to use 2 different encodings in one file?

Concerning Unicode: you are supposed to display the glyph of U+FFFD for
all character's that can't be displayed by other means - e.g. a generic
glyph displaying the codepoint or the code range. (Depending on your
situation you might also use U+FFFC).

http://www.unicode.org

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+67k3w+/yD4P9tIRAvl8AJ92uZbHz2oqLyJdoRH1grDhB854VACfQ1Aq
aczWIaLE5GTW9qE1vbPAceo=
=zdro
-----END PGP SIGNATURE-----
Jan 29 2005
parent reply Vathix <vathix dprogramming.com> writes:
 Are you trying to use 2 different encodings in one file?
Looks like DMD allows that in comments and I don't think it's a good idea.
Jan 29 2005
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Vathix wrote:

 I'm not sure if allowing non UTF characters in comments is such a good idea.
 Are you trying to use 2 different encodings in one file?
Looks like DMD allows that in comments and I don't think it's a good idea.
The current lexer just skips all bytes in comments, until it finds the end of the current comment run. And that's probably not a good idea, but simpler... (otherwise you would have to check all non-ASCIIs) You still cannot use such invalid UTF sequences for anything such as identifiers or strings, though... Just consider it a bug in the current DMD front-end ? (i.e. don't abuse this, since it'll be fixed one day) Says http://www.digitalmars.com/d/lex.html:
  D source text can be in one of the following formats:
 
     * ASCII
     * UTF-8
     * UTF-16BE
     * UTF-16LE
     * UTF-32BE
     * UTF-32LE 
This implies that *all* source input should be valid UTF (since ASCII is also valid as UTF-8) It *should* just stop dead when it finds one, e.g.: error("invalid UTF-8 sequence"); --anders PS. A nice feature would be to have the frontend convert from other encodings as well, but it would just add unneeded complexity since there are a *lot* of possible encodings out there (200)
Jan 29 2005
parent reply =?ISO-8859-15?Q?Thomas_K=FChne?= writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Björklund wrote:

| Vathix wrote:
|
|>>> I'm not sure if allowing non UTF characters in comments is such a
|>>> good idea.
|
|
|>> Are you trying to use 2 different encodings in one file?
|>
|>
|> Looks like DMD allows that in comments and I don't think it's a good
|> idea.
|
|
| The current lexer just skips all bytes in comments,
| until it finds the end of the current comment run.

[snip]

| It *should* just stop dead when it finds one, e.g.:
| error("invalid UTF-8 sequence");

I dont think the compiler should try to check the comment's content.

What is an "invalid" UTF-8 sequence?

How would you e.g. handle Java's pre 1.5 "customised" UTF-8?
(endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints)

- - Granted, we might agree on overlong  sequences, but how about
unassigned codepoints?
- - Has the input to be normalized? What normalization?
- - Are you going to enforce the full Unicode spec? What spec version?
- - How about the PUA?
- - How about >U-11FFFD?
- - Is U-FFFD/U-FFFC allowed?

Thomas


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+/SF3w+/yD4P9tIRArTkAJ9KOvumTbFe+2OdEbMwZSvNqCb3rACgqPcl
xSZ2C0Vk2bUsVHsqZUlKwQI=
=wAIW
-----END PGP SIGNATURE-----
Jan 29 2005
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Thomas Kühne wrote:

 | It *should* just stop dead when it finds one, e.g.:
 | error("invalid UTF-8 sequence");
 
 I dont think the compiler should try to check the comment's content.
Why not ? It checks the rest of the file...
 What is an "invalid" UTF-8 sequence?
I just think it should treat comments the same way it treats identifiers and literals ? That is, call: utf_decodeChar and follow whatever error that it returns... (utf.c) --anders
Jan 29 2005
parent reply =?ISO-8859-15?Q?Thomas_K=FChne?= writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Björklund wrote:

| Thomas Kühne wrote:
|
|> | It *should* just stop dead when it finds one, e.g.:
|> | error("invalid UTF-8 sequence");
|>
|> I dont think the compiler should try to check the comment's content.
|
|
| Why not ? It checks the rest of the file...
|
|> What is an "invalid" UTF-8 sequence?
|
| I just think it should treat comments the
| same way it treats identifiers and literals ?
|
| That is, call: utf_decodeChar and follow
| whatever error that it returns... (utf.c)

The current check for identifiers are:
1) shortes possible byte sequence for UTF-8
OK

2) no loone surrogate part
That might clash with pre 1.5 Java output.
This is a Java bug, thus can be ignored.

3) c <= 0x10FFFF
OK

4) c != 0xFFFE && c != 0xFFFF
That's the only check I reject. Those codepoints can occure if a
non-Unicode document is converted to UTF encoded Unicode. Inside of
comments they shouldn't stop the parsing.

Those checks above are - except for the 4th - reasonable for comments.

Thomas


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB+/zW3w+/yD4P9tIRAiGXAJ0T0A9Yj5FQMGR+aB60C3hGVU25UgCeKl7W
GcWKf1XK2ZLTmgh+BjasRhs=
=9oAb
-----END PGP SIGNATURE-----
Jan 29 2005
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Thomas Kühne wrote:

 4) c != 0xFFFE && c != 0xFFFF
 That's the only check I reject. Those codepoints can occure if a
 non-Unicode document is converted to UTF encoded Unicode. Inside of
 comments they shouldn't stop the parsing.
 
 Those checks above are - except for the 4th - reasonable for comments.
If needed, that can be hacked around for comments, for those two.
 	s = utf_decodeChar(octet, ndigits, &idx, &c);
 	if (s || idx != ndigits)
can be changed into: s = utf_decodeChar(octet, ndigits, &idx, &c); if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) Would that make it more reasonable ? (have the DMD patch ready...) --anders
Jan 29 2005
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
I wrote:

 Looks like DMD allows that in comments and I don't think it's a good
 idea.
[...]
 Would that make it more reasonable ? (have the DMD patch ready...)
Hilarious, the new patch made phobos fail:
 ../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequence
Due to this little comment line, from GDC:
    Modified by David Friedman, October 2004 (applied patches from Anders F
Björklund.)
(as the ö here was in Latin-1, you see...) --anders
Jan 29 2005
prev sibling parent reply =?ISO-8859-15?Q?Thomas_K=FChne?= writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Björklund schrieb:
| Thomas Kühne wrote:
|
|> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those
|> codepoints can occure if a non-Unicode document is converted to UTF
|> encoded Unicode. Inside of comments they shouldn't stop the
|> parsing.
|>
|> Those checks above are - except for the 4th - reasonable for
|> comments.
|
|
| If needed, that can be hacked around for comments, for those two.
|
|> s = utf_decodeChar(octet, ndigits, &idx, &c);
|> if (s || idx != ndigits)
|
|
| can be changed into:
|
| s = utf_decodeChar(octet, ndigits, &idx, &c);
| if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
|
| Would that make it more reasonable ? (have the DMD patch ready...)

Have a look at utf_decodeChar:
dmd/utf.c:92 and dmd/utf.c:183 ;)

While looking through utf.c I noticed that UTF-32 decoding doesn't
undergo any checks. I'll write a bunch of test cases for all those
encoding issues tomorrow.

Thomas

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFB/BgZ3w+/yD4P9tIRAviMAJwM/ZKfCMNEefi1ij3SfIPP0bz5OwCfQIq9
nFU3UMQ4FrQcwv1is2KoNEc=
=mgqS
-----END PGP SIGNATURE-----
Jan 29 2005
parent =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Thomas Kühne wrote:

 | can be changed into:
 |
 | s = utf_decodeChar(octet, ndigits, &idx, &c);
 | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
 |
 | Would that make it more reasonable ? (have the DMD patch ready...)
 
 Have a look at utf_decodeChar:
 dmd/utf.c:92 and dmd/utf.c:183 ;)
Yes, the idea is that it will not be valid and return string "invalid UTF-8 sequence", which is then ignored because the char is FFFE/F... (all input is converted to UTF-8 before lexer) The patch is in the digitalmars.D.bugs group. --anders
Jan 29 2005
prev sibling next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
What does it mean "non UTF character" ?
UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
charactes (code points).

So "non UTF character" sounds for me as "non UNICODE character". And what is 
that?
Some new alphabet?

Andrew Fedoniouk.
http://terrainformatica.com



"Vathix" <vathix dprogramming.com> wrote in message 
news:opsldc5xwrkcck4r esi...
 I'm not sure if allowing non UTF characters in comments is such a good 
 idea. It seems to be complicating my parser, and it will probably 
 complicate other things like text/code editors. What is supposed to happen 
 when a non UTF character is encountered? Should it display a question 
 mark, display nothing, use the current code page? What if the editor 
 doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily 
 parsed, this might be an issue; a special case.
 - Chris 
Jan 29 2005
next sibling parent reply Sebastian Beschke <s.beschke gmx.de> writes:
Andrew Fedoniouk schrieb:
 What does it mean "non UTF character" ?
 UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
 charactes (code points).
 
 So "non UTF character" sounds for me as "non UNICODE character". And what is 
 that?
 Some new alphabet?
Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. -Sebastian
Jan 29 2005
next sibling parent =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Sebastian Beschke wrote:

 Invalid sequences *are* possible by using codepoints in the table that 
 aren't defined, or by misforming UTF-8 or UTF-16 sequences.
A simple way to do it is to try to interpret a file in Latin-1 as UTF-8. That'll give you "invalid UTF-8 sequence", for everything outside ASCII. --anders
Jan 29 2005
prev sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
I see. Thanks, Sebastian.

But text with erroneous utf sequences (not "non UTF character", sic! ) will 
not be compiled anyway.

So for "... other things like text/code editors. What is supposed to happen
when a non UTF character is encountered?..." editor (good one) should mark 
them as
"bad string literal" or the like.

Andrew Fedoniouk.
http://terrainformatica.com


"Sebastian Beschke" <s.beschke gmx.de> wrote in message 
news:ctgkta$48a$1 digitaldaemon.com...
 Andrew Fedoniouk schrieb:
 What does it mean "non UTF character" ?
 UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
 charactes (code points).

 So "non UTF character" sounds for me as "non UNICODE character". And what 
 is that?
 Some new alphabet?
Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. -Sebastian
Jan 29 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Andrew Fedoniouk wrote:

 But text with erroneous utf sequences (not "non UTF character", sic! ) will 
 not be compiled anyway.
A bug in the current DMD makes it allow almost everything, in comments. --anders
Jan 29 2005
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 A bug in the current DMD makes it allow almost everything, in comments.
It's a feature rather than a bug. Preparation for attributed programming I guess. With option to include binary data inline :) I can imagine properties/methods having its own descriptional GIFs given in source text as bytes. Le Cauchemar! BTW: Are there any ports of png/jpeg/gif libs in D? Andrew Fedoniouk. http://terrainformatica.com
Jan 29 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Andrew Fedoniouk wrote:

 Preparation for attributed programming I guess. With option to include 
 binary data inline :)
:-) No, it's a bug. D source code is supposed to be valid UTF-8/16/32. Ideally, the HTML used should be made to be valid XHTML as well... --anders
Jan 29 2005
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 Ideally, the HTML used should be made to be valid XHTML as well...
Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as is HTML. First browser which will enforce showing *only valid* or even only well-formed docs will die for the market for many reasons. Even XHTML standard say that UA (user agent - browser) should try to show patrial content. Strictly speaking, partial content of XML doc is not well formed XML thus invalid. So browsers will show invalid XHTML anyway. So there will not be a strong motivation to use valid XHTML, so XHTML will loose its 'X' and become just HTML v.5,6, etc.... And I am silent yet about CSS grammar where whitespace is an operator with different meaning in different places.... Parsing nightmare. I love this game... What kind of grammar D uses, is it context free, BTW? Andrew Fedoniouk. http://terrainformatica.com
Jan 29 2005
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Andrew Fedoniouk wrote:

 Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as 
 is HTML.
XHTML *is* both HTML and XML at once, which is why it is so useful... Just as UTF-8 is both ASCII and Unicode at once, best of both worlds.
 First browser which will enforce showing *only valid* or even only 
 well-formed docs will die for the market for many reasons.
Just because certain browsers display it, is no reason for nonvalid markup. And it's easy to verify too, using http://validator.w3.org/ ? --anders
Jan 30 2005
prev sibling parent reply Derek <derek psyc.ward> writes:
On Sun, 30 Jan 2005 01:11:47 +0100, Anders F Björklund wrote:

 Andrew Fedoniouk wrote:
 
 Preparation for attributed programming I guess. With option to include 
 binary data inline :)
:-) No, it's a bug. D source code is supposed to be valid UTF-8/16/32. Ideally, the HTML used should be made to be valid XHTML as well... --anders
One could argue that comments are not actual D source code ;-) Consider: does the compiler needs the comments to create the application? If the comments are not for the compiler, then why should it care what's in the comments? -- Derek Melbourne, Australia
Jan 29 2005
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek wrote:

 One could argue that comments are not actual D source code ;-) 
 Consider: does the compiler needs the comments to create the application?
The source code is not just for the compiler. (I meant "compiler input")
 If the comments are not for the compiler, then why should it care what's in
 the comments?
It makes other stuff like parsers easier, if all .d files are valid UTF. --anders
Jan 30 2005
prev sibling parent Vathix <vathix dprogramming.com> writes:
 So "non UTF character" sounds for me as "non UNICODE character". And  
 what is
 that?
 Some new alphabet?
A value in the file that causes std.utf functions to throw an exception because it's invalid. I'm not good at this stuff and I don't know all the proper terminology.
Jan 29 2005
prev sibling next sibling parent "Walter" <newshound digitalmars.com> writes:
"Vathix" <vathix dprogramming.com> wrote in message
news:opsldc5xwrkcck4r esi...
 I'm not sure if allowing non UTF characters in comments is such a good
 idea. It seems to be complicating my parser, and it will probably
 complicate other things like text/code editors. What is supposed to happen
 when a non UTF character is encountered? Should it display a question
 mark, display nothing, use the current code page? What if the editor
 doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily
 parsed, this might be an issue; a special case.
 - Chris
Technically it's an error to have non-UTF characters anywhere in the source.
Jan 29 2005
prev sibling parent Brian Chapman <nospam-for-brian see-post-for-address.net> writes:
On 2005-01-29 08:57:23 -0600, Vathix <vathix dprogramming.com> said:

 I'm not sure if allowing non UTF characters in comments is such a good  
 idea. It seems to be complicating my parser, and it will probably  
 complicate other things like text/code editors. What is supposed to 
 happen  when a non UTF character is encountered? Should it display a 
 question  mark, display nothing, use the current code page? What if the 
 editor  doesn't know about D's comments?
 I might not have mentioned this, but since D is suppsed to be easily  
 parsed, this might be an issue; a special case.
 - Chris
I don't know what your situation/eviroment is, but you could just try piping the sources files through the GNU iconv utility to convert the files to whatever encoding you want.
Jan 29 2005