www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - string\utf question

reply Lars Ivar Igesund <larsivar igesund.net> writes:
I don't really know much about utf strings, but doing some processing of 
D files (e.g. ddepcheck), I suspect that I should know at least 
something before I proceed. lex.html states that identifiers can contain 
Universal Alphas. What is an universal alpha, and is it possible to 
check if a character is an universal alpha using some function currently 
in Phobos (or will it come with std.utype)?

Also, I use File : Stream's toString to get the content of the file (I 
don't care whether this is the most efficient way to do it or not, since 
it makes the processing itself much simpler compared to reading line by 
line). File's toString returns a char [] no matter what, whereas the 
std.ctype functions all take dchars as inputs.
What's the recommended type to use (char, wchar, dchar)?
What's the recommended way to convert the char [] to the best type?

Lars Ivar Igesund
Aug 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...
I don't really know much about utf strings, but doing some processing of 
D files (e.g. ddepcheck), I suspect that I should know at least 
something before I proceed. lex.html states that identifiers can contain 
Universal Alphas. What is an universal alpha,

Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head to Annex D on page 438. (Obvious really! :))
and is it possible to 
check if a character is an universal alpha using some function currently 
in Phobos (or will it come with std.utype)?

No and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).
Also, I use File : Stream's toString to get the content of the file (I 
don't care whether this is the most efficient way to do it or not, since 
it makes the processing itself much simpler compared to reading line by 
line). File's toString returns a char [] no matter what, whereas the 
std.ctype functions all take dchars as inputs.
What's the recommended type to use (char, wchar, dchar)?
What's the recommended way to convert the char [] to the best type?

That's an application-dependent question, but personally I'd just do std.utf.toUTF32(char[]). Arcane Jill
Aug 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...
No and no. But it would be easy for me to add such a function to etc.unicode if
you want. It would end up being called isUniversalAlpha(dchar).

Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense? Jill (trying to stay organized)
Aug 05 2004
next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Arcane Jill wrote:

 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.
 
No and no. But it would be easy for me to add such a function to etc.unicode if
you want. It would end up being called isUniversalAlpha(dchar).

Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense?

Well, maybe I can write that one myself and just keep it in my project until it is included in Phobos (unless it is very complicated). Currently I have no need for (other) external libraries. Anyway, thanks for the answers. Lars Ivar Igesund
Aug 05 2004
parent Lars Ivar Igesund <larsivar igesund.net> writes:
Lars Ivar Igesund wrote:

 Arcane Jill wrote:
 
 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...

 In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.

Shows up correct here, and so something is strange somewhere (or was when I posted that message this morning). Many posts on the newsgroup has strange times (IMO), possibly because time zones are handled differently in different clients (and maybe on the server). Also, your message shows up with the time 9:06 (GMT+1, that is) in my Thunderbird. Maybe discrepancies pop up when some time pass from it is sent to it is accepted at the server at the same time as there are time zone differences. Well, I don't know, the *real* answer is probably that you have time machine and went back in time to answer my questions. Lars Ivar Igesund
Aug 05 2004
prev sibling next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Arcane Jill wrote:

 
 Thinking about this logically, isUniversalAlpha() would be a custom property,
 not actually part of the Unicode standard, so it probably doesn't really belong
 in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
 etc.compiler and put it there, where it could stay until (if) Walter moves it.
 Would that make more sense?
 
 Jill (trying to stay organized)

I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart. Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas, or are there unmentioned exceptions? Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side. Lars Ivar Igesund
Aug 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...

I looked at the document you pointed out (together with the docs...), 
but the ranges there include Digits, and digits aren't part of the 
Universal Alphas allowed to use as an IdentifierStart.

Well, digits are allowed in identifiers - just not at the start.
Sorry for acting 
stupid here, but are Digits and Special Characters from Annex D part of 
the Universal Alphas,

Yes, they are.
or are there unmentioned exceptions?

Not so far as I am aware. In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be arbitrarily long, and are case sensitive. Identifiers starting with __ (two underscores) are reserved." So it looks like "universal alphas" are not actually permitted as the /first/ character of a D identifier, only as the second or subsequent char. The /first/ character is apparently allowed to be a "unicode alpha", not a "universal alpha". Of course, this begs the question "what is a unicode alpha"? The docs don't define this. Almost certainly, Walter means "a Unicode character which has the Alphabetic property", but I don't know that for sure. I would suggest this needs to be clarified in the documentation. The definition should also state /which version/ of Unicode the D compiler uses, since new "unicode alphas" will be added with each new version of Unicode. (Otherwise you could end up in the curious state whereby new Unicode letters would be allowed at the start of an identifier but not in the middle or end!) Moreover, you probably wouldn't /want/ the definition of an identifier to change with each new release of Unicode. Suggestion to Walter: you could redefine an identifier start to be: "an ASCII letter, underscore, or any universal alpha which has the Unicode Alphabetic property" (and, obviously, ensure that this definition is met, which it probably is already). Now you don't need to state a Unicode version number, because you're dealing only with a fixed and stable subset of Unicode.
Also, trying 
to add any of these characters to my identifier names using Vim's 
hexadecimal mode fails miserably, but that's probably an error on my side.

I know nothing about vim.
Aug 05 2004
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Arcane Jill wrote:

 In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...
 
 
I looked at the document you pointed out (together with the docs...), 
but the ranges there include Digits, and digits aren't part of the 
Universal Alphas allowed to use as an IdentifierStart.

Well, digits are allowed in identifiers - just not at the start.
Sorry for acting 
stupid here, but are Digits and Special Characters from Annex D part of 
the Universal Alphas,

Yes, they are.
or are there unmentioned exceptions?

Not so far as I am aware. In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be arbitrarily long, and are case sensitive. Identifiers starting with __ (two underscores) are reserved." So it looks like "universal alphas" are not actually permitted as the /first/ character of a D identifier, only as the second or subsequent char. The /first/ character is apparently allowed to be a "unicode alpha", not a "universal alpha". Of course, this begs the question "what is a unicode alpha"? The docs don't define this. Almost certainly, Walter means "a Unicode character which has the Alphabetic property", but I don't know that for sure. I would suggest this needs to be clarified in the documentation. The definition should also state /which version/ of Unicode the D compiler uses, since new "unicode alphas" will be added with each new version of Unicode. (Otherwise you could end up in the curious state whereby new Unicode letters would be allowed at the start of an identifier but not in the middle or end!) Moreover, you probably wouldn't /want/ the definition of an identifier to change with each new release of Unicode. Suggestion to Walter: you could redefine an identifier start to be: "an ASCII letter, underscore, or any universal alpha which has the Unicode Alphabetic property" (and, obviously, ensure that this definition is met, which it probably is already). Now you don't need to state a Unicode version number, because you're dealing only with a fixed and stable subset of Unicode.

Hmm, that didn't answer anything, you just came to the same conclusion as me regarding the somewhat lacking documentation :) Another point, above the text you quoted, " IdentifierStart: _ Letter UniversalAlpha " Note that there is no mention of Unicode Alpha, neither there or elsewhere except in the excerpt you mentioned. Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)
Also, trying 
to add any of these characters to my identifier names using Vim's 
hexadecimal mode fails miserably, but that's probably an error on my side.

I know nothing about vim.

Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test. Lars Ivar Igesund
Aug 05 2004
next sibling parent J C Calvarese <jcc7 cox.net> writes:
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Lars Ivar Igesund wrote:
 Arcane Jill wrote:
 
 In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...


 Hmm, that didn't answer anything, you just came to the same conclusion 
 as me regarding the somewhat lacking documentation :) Another point, 
 above the text you quoted,
 "
 IdentifierStart:
     _
     Letter
     UniversalAlpha
 "
 
 Note that there is no mention of Unicode Alpha, neither there or 
 elsewhere except in the excerpt you mentioned.

I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".
 
 Walter, obvious bug in documentation has been found. A fix would be 
 received with joyous celebrations across the globe (or at least in an 
 axis between me and Jill). And a clarification in this thread :)
 
 Also, trying to add any of these characters to my identifier names 
 using Vim's hexadecimal mode fails miserably, but that's probably an 
 error on my side.

I know nothing about vim.

Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test. Lars Ivar Igesund

I don't know about your test, but I got D to accept a Spanish letter (ñ) and a Chinese character (義) as an identifier. In case, this stuff gets garbled in the transmission, I attached a .zip. const char[] ñ = "eñe"; const char[] 義 = "justice"; import std.stdio; void main() { writefln("Feliz Cumpleaños."); writefln(ñ); writefln(義); /* It doesn't print right, but that's probably DOS's fault. */ } -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 05 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...

Well, what I did, was to add characters from the list as part of 
identifiers, but dmd didn't accept them. It is possible that the file 
wasn't saved in the correct format, but it didn't look like the problem. 
I might look at again later when I have something to test.

I know a little about Unicode, but far less about compilers, and pretty much nothing at all about the D compiler. I'm afraid this question will have to be answered by someone else. J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant "UniversalAlpha".' If this is true, it would seem strange (to me). It would imply that non-ASCII-digits are allowed as the first character of an identifier. Not that that's necessarily a /bad/ thing - just unexpected. Jill
Aug 06 2004
next sibling parent J C Calvarese <jcc7 cox.net> writes:
In article <cevaqt$2u57$1 digitaldaemon.com>, Arcane Jill says...
In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...

Well, what I did, was to add characters from the list as part of 
identifiers, but dmd didn't accept them. It is possible that the file 
wasn't saved in the correct format, but it didn't look like the problem. 
I might look at again later when I have something to test.

I know a little about Unicode, but far less about compilers, and pretty much nothing at all about the D compiler. I'm afraid this question will have to be answered by someone else. J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant "UniversalAlpha".' If this is true, it would seem strange (to me). It would imply that non-ASCII-digits are allowed as the first character of an identifier. Not that that's necessarily a /bad/ thing - just unexpected. Jill

(Did you happen to scroll down to my example in digitalmars.D/8333?) It's definitely allowed. I was able to use a _Chinese character_ as an identifier. It was the first character. It was the last character. It was the only character. I can upload the example to my web site if you can't get the example from the post. jcc7
Aug 06 2004
prev sibling parent "Martin M. Pedersen" <martin moeller-pedersen.dk> writes:
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cevaqt$2u57$1 digitaldaemon.com...
 If this is true, it would seem strange (to me). It would imply that
 non-ASCII-digits are allowed as the first character of an identifier. Not

 that's necessarily a /bad/ thing - just unexpected.

C allows it, and so must D to be link-compatible. The relevant C grammar is: identifier: identifier-nondigit identifier identifier-nondigit identifier digit identifier-nondigit: nondigit universal-character-name other implementation-defined characters universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Constraints A universal character name shall not specify a character short identifier in the range 00000000 through 00000020, 0000007F through 0000009F, or 0000D800 through 0000DFFF inclusive. A universal character name shall not designate a character in the required character set. Description Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the required character set. Semantics The universal character name \Unnnnnnnn designates the character whose character short identifier (as specified by ISO/IEC 10646) is nnnnnnnn. Similarly, the universal character name \unnnn designates the character whose character short identifier is 0000nnnn. Regards, Martin M. Pedersen
Aug 06 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Lars Ivar Igesund" <larsivar igesund.net> wrote in message
news:ceu2eq$21vs$1 digitaldaemon.com...
 Walter, obvious bug in documentation has been found. A fix would be
 received with joyous celebrations across the globe (or at least in an
 axis between me and Jill). And a clarification in this thread :)

I'll fix it.
Aug 06 2004
prev sibling parent J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

The web interface runs on tachyons. ;) Actually, I've seen this happen before. I think it's related to the delay in time before a message appears on the web when it's posted through the web interface. The order looks normal if you're viewing through Thunderbird. Go figure. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 05 2004