digitalmars.D - string\utf question

Lars Ivar Igesund (14/14) Aug 05 2004 I don't really know much about utf strings, but doing some processing of...

Arcane Jill (8/22) Aug 05 2004 Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head...

Arcane Jill (10/13) Aug 05 2004 Hey, Lars, my reply to your post sorts before your post. Isn't that weir...

Lars Ivar Igesund (9/27) Aug 05 2004 That is indeed curious, my system clock is only 5 minutes quick, but the...

Lars Ivar Igesund (12/30) Aug 05 2004 Shows up correct here, and so something is strange somewhere (or was

Lars Ivar Igesund (9/17) Aug 05 2004 I looked at the document you pointed out (together with the docs...),

Arcane Jill (30/40) Aug 05 2004 Yes, they are.

Lars Ivar Igesund (20/79) Aug 05 2004 Hmm, that didn't answer anything, you just came to the same conclusion

J C Calvarese (19/54) Aug 05 2004 I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".
Arcane Jill (10/14) Aug 06 2004 I know a little about Unicode, but far less about compilers, and pretty ...

J C Calvarese (9/23) Aug 06 2004 (Did you happen to scroll down to my example in
Martin M. Pedersen (33/36) Aug 06 2004 that

Walter (3/6) Aug 06 2004 I'll fix it.

J C Calvarese (10/18) Aug 05 2004 The web interface runs on tachyons. ;)

Lars Ivar Igesund <larsivar igesund.net> writes:

I don't really know much about utf strings, but doing some processing of 
D files (e.g. ddepcheck), I suspect that I should know at least 
something before I proceed. lex.html states that identifiers can contain 
Universal Alphas. What is an universal alpha, and is it possible to 
check if a character is an universal alpha using some function currently 
in Phobos (or will it come with std.utype)?

Also, I use File : Stream's toString to get the content of the file (I 
don't care whether this is the most efficient way to do it or not, since 
it makes the processing itself much simpler compared to reading line by 
line). File's toString returns a char [] no matter what, whereas the 
std.ctype functions all take dchars as inputs.
What's the recommended type to use (char, wchar, dchar)?
What's the recommended way to convert the char [] to the best type?

Lars Ivar Igesund

Aug 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...
I don't really know much about utf strings, but doing some processing of 
D files (e.g. ddepcheck), I suspect that I should know at least 
something before I proceed. lex.html states that identifiers can contain 
Universal Alphas. What is an universal alpha,

Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head to
Annex D on page 438. (Obvious really! :))



and is it possible to 
check if a character is an universal alpha using some function currently 
in Phobos (or will it come with std.utype)?

No and no. But it would be easy for me to add such a function to etc.unicode if
you want. It would end up being called isUniversalAlpha(dchar).



Also, I use File : Stream's toString to get the content of the file (I 
don't care whether this is the most efficient way to do it or not, since 
it makes the processing itself much simpler compared to reading line by 
line). File's toString returns a char [] no matter what, whereas the 
std.ctype functions all take dchars as inputs.
What's the recommended type to use (char, wchar, dchar)?
What's the recommended way to convert the char [] to the best type?

That's an application-dependent question, but personally I'd just do
std.utf.toUTF32(char[]).

Arcane Jill

Aug 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
says 09:41:38 + 1 (which must be wrong). Curious. Anyway...


No and no. But it would be easy for me to add such a function to etc.unicode if
you want. It would end up being called isUniversalAlpha(dchar).

Thinking about this logically, isUniversalAlpha() would be a custom property,
not actually part of the Unicode standard, so it probably doesn't really belong
in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
etc.compiler and put it there, where it could stay until (if) Walter moves it.
Would that make more sense?

Jill (trying to stay organized)

Aug 05 2004

Lars Ivar Igesund <larsivar igesund.net> writes:

Arcane Jill wrote:

 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

 
 
 Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
 post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
 says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

That is indeed curious, my system clock is only 5 minutes quick, but the 
post is in my sent folder with the 9:41 time. Ok, this post should get 
the time 16:50 (GMT+1) if everythings correct.

 
No and no. But it would be easy for me to add such a function to etc.unicode if
you want. It would end up being called isUniversalAlpha(dchar).

 
 
 Thinking about this logically, isUniversalAlpha() would be a custom property,
 not actually part of the Unicode standard, so it probably doesn't really belong
 in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
 etc.compiler and put it there, where it could stay until (if) Walter moves it.
 Would that make more sense?

Well, maybe I can write that one myself and just keep it in my project 
until it is included in Phobos (unless it is very complicated). 
Currently I have no need for (other) external libraries. Anyway, thanks 
for the answers.

Lars Ivar Igesund

Aug 05 2004

Lars Ivar Igesund <larsivar igesund.net> writes:

Lars Ivar Igesund wrote:

 Arcane Jill wrote:
 
 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...

 In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...



 Hey, Lars, my reply to your post sorts before your post. Isn't that 
 weird? My
 post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted 
 it). Yours
 says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

 
 
 That is indeed curious, my system clock is only 5 minutes quick, but the 
 post is in my sent folder with the 9:41 time. Ok, this post should get 
 the time 16:50 (GMT+1) if everythings correct.

Shows up correct here, and so something is strange somewhere (or was 
when I posted that message this morning). Many posts on the newsgroup 
has strange times (IMO), possibly because time zones are handled 
differently in different clients (and maybe on the server). Also, your 
message shows up with the time 9:06 (GMT+1, that is) in my Thunderbird. 
  Maybe discrepancies pop up when some time pass from it is sent to it 
is accepted at the server at the same time as there are time zone 
differences.

Well, I don't know, the *real* answer is probably that you have time 
machine and went back in time to answer my questions.

Lars Ivar Igesund

Aug 05 2004

Lars Ivar Igesund <larsivar igesund.net> writes:

Arcane Jill wrote:

 
 Thinking about this logically, isUniversalAlpha() would be a custom property,
 not actually part of the Unicode standard, so it probably doesn't really belong
 in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
 etc.compiler and put it there, where it could stay until (if) Walter moves it.
 Would that make more sense?
 
 Jill (trying to stay organized)

I looked at the document you pointed out (together with the docs...), 
but the ranges there include Digits, and digits aren't part of the 
Universal Alphas allowed to use as an IdentifierStart. Sorry for acting 
stupid here, but are Digits and Special Characters from Annex D part of 
the Universal Alphas, or are there unmentioned exceptions? Also, trying 
to add any of these characters to my identifier names using Vim's 
hexadecimal mode fails miserably, but that's probably an error on my side.

Lars Ivar Igesund

Aug 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...

I looked at the document you pointed out (together with the docs...), 
but the ranges there include Digits, and digits aren't part of the 
Universal Alphas allowed to use as an IdentifierStart.

Well, digits are allowed in identifiers - just not at the start.

Sorry for acting 
stupid here, but are Digits and Special Characters from Annex D part of 
the Universal Alphas,

Yes, they are.

or are there unmentioned exceptions?

Not so far as I am aware.

In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a
letter, _, or unicode alpha, and are followed by any number of letters, _,
digits, or universal alphas. Universal alphas are as defined in ISO/IEC
9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be
arbitrarily long, and are case sensitive. Identifiers starting with __ (two
underscores) are reserved."

So it looks like "universal alphas" are not actually permitted as the /first/
character of a D identifier, only as the second or subsequent char. The /first/
character is apparently allowed to be a "unicode alpha", not a "universal
alpha".

Of course, this begs the question "what is a unicode alpha"? The docs don't
define this. Almost certainly, Walter means "a Unicode character which has the
Alphabetic property", but I don't know that for sure. I would suggest this needs
to be clarified in the documentation. The definition should also state /which
version/ of Unicode the D compiler uses, since new "unicode alphas" will be
added with each new version of Unicode. (Otherwise you could end up in the
curious state whereby new Unicode letters would be allowed at the start of an
identifier but not in the middle or end!)

Moreover, you probably wouldn't /want/ the definition of an identifier to change
with each new release of Unicode. 

Suggestion to Walter: you could redefine an identifier start to be: "an ASCII
letter, underscore, or any universal alpha which has the Unicode Alphabetic
property" (and, obviously, ensure that this definition is met, which it probably
is already). Now you don't need to state a Unicode version number, because
you're dealing only with a fixed and stable subset of Unicode.


Also, trying 
to add any of these characters to my identifier names using Vim's 
hexadecimal mode fails miserably, but that's probably an error on my side.

I know nothing about vim.

Aug 05 2004

Lars Ivar Igesund <larsivar igesund.net> writes:

Arcane Jill wrote:

 In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...
 
 
I looked at the document you pointed out (together with the docs...), 
but the ranges there include Digits, and digits aren't part of the 
Universal Alphas allowed to use as an IdentifierStart.

 
 
 Well, digits are allowed in identifiers - just not at the start.
 
 
Sorry for acting 
stupid here, but are Digits and Special Characters from Annex D part of 
the Universal Alphas,

 
 
 Yes, they are.
 
 
or are there unmentioned exceptions?

 
 
 Not so far as I am aware.
 
 In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a
 letter, _, or unicode alpha, and are followed by any number of letters, _,
 digits, or universal alphas. Universal alphas are as defined in ISO/IEC
 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be
 arbitrarily long, and are case sensitive. Identifiers starting with __ (two
 underscores) are reserved."
 
 So it looks like "universal alphas" are not actually permitted as the /first/
 character of a D identifier, only as the second or subsequent char. The /first/
 character is apparently allowed to be a "unicode alpha", not a "universal
 alpha".
 
 Of course, this begs the question "what is a unicode alpha"? The docs don't
 define this. Almost certainly, Walter means "a Unicode character which has the
 Alphabetic property", but I don't know that for sure. I would suggest this
needs
 to be clarified in the documentation. The definition should also state /which
 version/ of Unicode the D compiler uses, since new "unicode alphas" will be
 added with each new version of Unicode. (Otherwise you could end up in the
 curious state whereby new Unicode letters would be allowed at the start of an
 identifier but not in the middle or end!)
 
 Moreover, you probably wouldn't /want/ the definition of an identifier to
change
 with each new release of Unicode. 
 
 Suggestion to Walter: you could redefine an identifier start to be: "an ASCII
 letter, underscore, or any universal alpha which has the Unicode Alphabetic
 property" (and, obviously, ensure that this definition is met, which it
probably
 is already). Now you don't need to state a Unicode version number, because
 you're dealing only with a fixed and stable subset of Unicode.

Hmm, that didn't answer anything, you just came to the same conclusion 
as me regarding the somewhat lacking documentation :) Another point, 
above the text you quoted,
"
IdentifierStart:
	_
	Letter
	UniversalAlpha
"

Note that there is no mention of Unicode Alpha, neither there or 
elsewhere except in the excerpt you mentioned.

Walter, obvious bug in documentation has been found. A fix would be 
received with joyous celebrations across the globe (or at least in an 
axis between me and Jill). And a clarification in this thread :)

Also, trying 
to add any of these characters to my identifier names using Vim's 
hexadecimal mode fails miserably, but that's probably an error on my side.

 
 
 I know nothing about vim.

Well, what I did, was to add characters from the list as part of 
identifiers, but dmd didn't accept them. It is possible that the file 
wasn't saved in the correct format, but it didn't look like the problem. 
I might look at again later when I have something to test.

Lars Ivar Igesund

Aug 05 2004

J C Calvarese <jcc7 cox.net> writes:

Lars Ivar Igesund wrote:
 Arcane Jill wrote:
 
 In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...


...

 Hmm, that didn't answer anything, you just came to the same conclusion 
 as me regarding the somewhat lacking documentation :) Another point, 
 above the text you quoted,
 "
 IdentifierStart:
     _
     Letter
     UniversalAlpha
 "
 
 Note that there is no mention of Unicode Alpha, neither there or 
 elsewhere except in the excerpt you mentioned.

I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".

 
 Walter, obvious bug in documentation has been found. A fix would be 
 received with joyous celebrations across the globe (or at least in an 
 axis between me and Jill). And a clarification in this thread :)
 
 Also, trying to add any of these characters to my identifier names 
 using Vim's hexadecimal mode fails miserably, but that's probably an 
 error on my side.



 I know nothing about vim.

 
 
 Well, what I did, was to add characters from the list as part of 
 identifiers, but dmd didn't accept them. It is possible that the file 
 wasn't saved in the correct format, but it didn't look like the problem. 
 I might look at again later when I have something to test.
 
 Lars Ivar Igesund

I don't know about your test, but I got D to accept a Spanish letter (ñ) 
and a Chinese character (義) as an identifier.

In case, this stuff gets garbled in the transmission, I attached a .zip.


const char[] ñ = "eñe";
const char[] 義 = "justice";

import std.stdio;

void main()
{
     writefln("Feliz Cumpleaños.");
     writefln(ñ);
     writefln(義);

     /* It doesn't print right, but that's probably DOS's fault. */
}


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...

Well, what I did, was to add characters from the list as part of 
identifiers, but dmd didn't accept them. It is possible that the file 
wasn't saved in the correct format, but it didn't look like the problem. 
I might look at again later when I have something to test.

I know a little about Unicode, but far less about compilers, and pretty much
nothing at all about the D compiler. I'm afraid this question will have to be
answered by someone else.

J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant
"UniversalAlpha".'

If this is true, it would seem strange (to me). It would imply that
non-ASCII-digits are allowed as the first character of an identifier. Not that
that's necessarily a /bad/ thing - just unexpected.

Jill

Aug 06 2004

J C Calvarese <jcc7 cox.net> writes:

In article <cevaqt$2u57$1 digitaldaemon.com>, Arcane Jill says...
In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...

Well, what I did, was to add characters from the list as part of 
identifiers, but dmd didn't accept them. It is possible that the file 
wasn't saved in the correct format, but it didn't look like the problem. 
I might look at again later when I have something to test.

I know a little about Unicode, but far less about compilers, and pretty much
nothing at all about the D compiler. I'm afraid this question will have to be
answered by someone else.

J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant
"UniversalAlpha".'

If this is true, it would seem strange (to me). It would imply that
non-ASCII-digits are allowed as the first character of an identifier. Not that
that's necessarily a /bad/ thing - just unexpected.

Jill

(Did you happen to scroll down to my example in
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8333?)

It's definitely allowed. I was able to use a _Chinese character_ as an
identifier. It was the first character. It was the last character. It was the
only character.

I can upload the example to my web site if you can't get the example from the
post.

jcc7

Aug 06 2004

"Martin M. Pedersen" <martin moeller-pedersen.dk> writes:

"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cevaqt$2u57$1 digitaldaemon.com...
 If this is true, it would seem strange (to me). It would imply that
 non-ASCII-digits are allowed as the first character of an identifier. Not

that
 that's necessarily a /bad/ thing - just unexpected.

C allows it, and so must D to be link-compatible. The relevant C grammar is:

identifier:
    identifier-nondigit
    identifier identifier-nondigit
    identifier digit

identifier-nondigit:
    nondigit
    universal-character-name
    other implementation-defined characters

universal-character-name:
    \u hex-quad
    \U hex-quad hex-quad

hex-quad:
    hexadecimal-digit hexadecimal-digit  hexadecimal-digit hexadecimal-digit

Constraints
A universal character name shall not specify a character short identifier in
the range 00000000 through 00000020, 0000007F through 0000009F, or 0000D800
through 0000DFFF inclusive. A universal character name shall not designate a
character in the required character set.

Description
Universal character names may be used in identifiers, character constants,
and string literals to designate characters that are not in the required
character set.

Semantics
The universal character name \Unnnnnnnn designates the character whose
character short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.
Similarly, the universal character name \unnnn designates the character
whose character short identifier is 0000nnnn.


Regards,
Martin M. Pedersen

Aug 06 2004

"Walter" <newshound digitalmars.com> writes:

"Lars Ivar Igesund" <larsivar igesund.net> wrote in message
news:ceu2eq$21vs$1 digitaldaemon.com...
 Walter, obvious bug in documentation has been found. A fix would be
 received with joyous celebrations across the globe (or at least in an
 axis between me and Jill). And a clarification in this thread :)

I'll fix it.

Aug 06 2004

J C Calvarese <jcc7 cox.net> writes:

Arcane Jill wrote:
 In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...

 
 
 Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
 post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
 says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

The web interface runs on tachyons. ;)

Actually, I've seen this happen before. I think it's related to the 
delay in time before a message appears on the web when it's posted 
through the web interface. The order looks normal if you're viewing 
through Thunderbird.

Go figure.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 05 2004

D Programming

C/C++ Programming

Other

digitalmars.D - string\utf question