www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - identifiers & "unialpha"

reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

http://www.digitalmars.com/d/lex.html#identifier





Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
"universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
"universal alpha".

Sample:
\u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
allowed by Appendix D in identifiers.

"ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
"ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
drop the redirection via "Appendix D" and use
"ISO/IEC TR 10176 (current)" instead of the dated version
"ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
chunk of CJK and Math characters that can be found in the current version.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFE/7wLK5blCcjpWoRAmkaAKCrkQoYh52hH1EO97xUMU4iQaJaywCgiR6E
tE8uxEORDcyK2epapicDHHY=
=Oop9
-----END PGP SIGNATURE-----
Sep 22 2006
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 http://www.digitalmars.com/d/lex.html#identifier




 
 Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
 "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
 "universal alpha".
 
 Sample:
 \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
 allowed by Appendix D in identifiers.
 
 "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
 "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
 drop the redirection via "Appendix D" and use
 "ISO/IEC TR 10176 (current)" instead of the dated version
 "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
 chunk of CJK and Math characters that can be found in the current version.
Agreed. Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely. So I suspect your suggestion would eliminate the problem you mention above as well? Sean
Sep 22 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sean Kelly schrieb am 2006-09-22:
 Thomas Kuehne wrote:
 
 http://www.digitalmars.com/d/lex.html#identifier




 
 Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
 "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
 "universal alpha".
 
 Sample:
 \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
 allowed by Appendix D in identifiers.
 
 "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
 "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
 drop the redirection via "Appendix D" and use
 "ISO/IEC TR 10176 (current)" instead of the dated version
 "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
 chunk of CJK and Math characters that can be found in the current version.
Agreed. Incidentally, the 2003 revision to the C++ standard ("ISO/IEC 14882:2003(E)"), Appendix E, contains a revised copy of the character table (which is likely from "ISO/IEC TR 10176:2003") and appears to have done away with the "special characters" section entirely. So I suspect your suggestion would eliminate the problem you mention above as well?
Yes. How about this rewrite: Accessing ISO standarts can be complicated. Here are the crossreferences for Unicode's UnicodeData.txt. For the relation between Unicode and ISO10176 see http://en.wikipedia.org/wiki/ISO/IEC_10646#Differences_between_ISO_10646_and_Unicode Letters: Uppercase_Letter (Lu) Lowercase_Letter (Ll) Titlecase_Letter (Lt) Modifier_Letter (Lm) Other_Letter (Lo) NonspacingMarks: Nonspacing_Mark (Mn) Numbers: Decimal_Number (Nd) Letter_Number (Nl) Other_Number (No) Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFFB8/LK5blCcjpWoRAnMPAJsEaehF35W70k8S+BXbSSHXOeum8wCfR1UU XeNEnZrWU8TYWSfzikQPm/8= =n9aW -----END PGP SIGNATURE-----
Sep 22 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 http://www.digitalmars.com/d/lex.html#identifier




 
 Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
 "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
 "universal alpha".
 
 Sample:
 \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
 allowed by Appendix D in identifiers.
 
 "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
 "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
 drop the redirection via "Appendix D" and use
 "ISO/IEC TR 10176 (current)" instead of the dated version
 "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
 chunk of CJK and Math characters that can be found in the current version.
I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK? As it is now, it matches standard C's definition of identifiers, which is the intent of the reference. I haven't checked, but I think it matches Java's idea of an identifier character, too. P.S. It also bugs me that the unicode people can't seem to make up their minds. Do character sets really need to change every 2 or 3 years?
Sep 22 2006
next sibling parent reply Pragma <ericanderton yahoo.removeme.com> writes:
 Thomas Kuehne wrote:
 What is CJK?
Just a guess: "Chinese, Japanese & Korean"? - Eric
Sep 22 2006
parent nobody <nobody mailinator.com> writes:
Pragma wrote:
 Thomas Kuehne wrote:
 What is CJK?
Just a guess: "Chinese, Japanese & Korean"? - Eric
Your guess is correct. Wikipedia does a great job explaining CJK: http://en.wikipedia.org/wiki/CJK
Sep 22 2006
prev sibling parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-09-22:
 Thomas Kuehne wrote:
 
 http://www.digitalmars.com/d/lex.html#identifier




 
 Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
 "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
 "universal alpha".
 
 Sample:
 \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
 allowed by Appendix D in identifiers.
 
 "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
 "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
 drop the redirection via "Appendix D" and use
 "ISO/IEC TR 10176 (current)" instead of the dated version
 "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
 chunk of CJK and Math characters that can be found in the current version.
I'd like to leave things as they are for 1.0. I don't think that anyone's code will be adversely affected by not having the latest alpha character additions to identifiers, and I also don't think math characters should be part of identifiers. What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
 As it is now, it matches standard C's definition of identifiers, which 
 is the intent of the reference. I haven't checked, but I think it 
 matches Java's idea of an identifier character, too.
ISO/IEC 9899:1999 (E) Appendix D Whereas Appendix D defines valid characters in identifiers, D uses it as a source for "universal alpha". As a consequence std.uni.isUniAlpha claims that \u00B7 (MIDDLE DOT) is a letter...
 P.S. It also bugs me that the unicode people can't seem to make up their 
 minds. Do character sets really need to change every 2 or 3 years?
Task at hand: Create a table of all characters used by humans all over the world and minimize friction due to political issues (e.g. characters' names). Except for bug fixes (typos...) the unicode people usually only extend previous versions of the standard. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFFGNBLK5blCcjpWoRAh+mAJ9k2lTcyhSiNjFsVRtCtiDhbCVdQwCdHiKE LTtcD8IPwAUsHWoJMMXm+70= =wNTb -----END PGP SIGNATURE-----
Sep 22 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Thomas Kuehne wrote:
 Walter Bright schrieb am 2006-09-22:
 What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
Thank-you.
 As it is now, it matches standard C's definition of identifiers, which 
 is the intent of the reference. I haven't checked, but I think it 
 matches Java's idea of an identifier character, too.
ISO/IEC 9899:1999 (E) Appendix D Whereas Appendix D defines valid characters in identifiers, D uses it as a source for "universal alpha". As a consequence std.uni.isUniAlpha claims that \u00B7 (MIDDLE DOT) is a letter...
I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference?
 
 P.S. It also bugs me that the unicode people can't seem to make up their 
 minds. Do character sets really need to change every 2 or 3 years?
Task at hand: Create a table of all characters used by humans all over the world and minimize friction due to political issues (e.g. characters' names). Except for bug fixes (typos...) the unicode people usually only extend previous versions of the standard.
Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.
Sep 22 2006
next sibling parent Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Thomas Kuehne wrote:
 ISO/IEC 9899:1999 (E) Appendix D



 Whereas Appendix D defines valid characters in identifiers, D uses it
 as a source for "universal alpha". As a consequence std.uni.isUniAlpha
 claims that \u00B7 (MIDDLE DOT) is a letter...
I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference?
No, there are other differences as well. I think C99 was simply referring to the latest version of the document available in 1999, and it has since been revised (in 2003, apparently). But I have no idea why characters present in the 1999 doc are not present in the 2003 doc. To pass the buck even further, "ISO/IEC TR 10176:2003" Annex A says the following: This list comprises the letters (combining or not), syllables, and ideographs from ISO/IEC 10646-1, together with the modifier letters and marks conventionally used as parts of words. So their list of characters is copied from the Unicode standard (ISO/IEC 10646). I can only conclude that the Unicode standard changed between 1999-2003 and ISO/IEC 10176 simply incorporated the new list. But who knows why the list was changed. This does raise an interesting point however. Since the C and C++ standards separately refer to SO/IEC 10176 for their character list, the identifiers a compliant C99 and C++2003 compiler should accept are different. This seems contrary to the usual C++ practice of deferring to the C standard on semantic issues.
 P.S. It also bugs me that the unicode people can't seem to make up 
 their minds. Do character sets really need to change every 2 or 3 years?
Task at hand: Create a table of all characters used by humans all over the world and minimize friction due to political issues (e.g. characters' names). Except for bug fixes (typos...) the unicode people usually only extend previous versions of the standard.
Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.
Me either. But then I'm not terribly inclined to read the Unicode standards committee minutes to find out either :-) Sean
Sep 22 2006
prev sibling next sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-09-22:
 Thomas Kuehne wrote:
 Walter Bright schrieb am 2006-09-22:
 What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
Thank-you.
 As it is now, it matches standard C's definition of identifiers, which 
 is the intent of the reference. I haven't checked, but I think it 
 matches Java's idea of an identifier character, too.
ISO/IEC 9899:1999 (E) Appendix D Whereas Appendix D defines valid characters in identifiers, D uses it as a source for "universal alpha". As a consequence std.uni.isUniAlpha claims that \u00B7 (MIDDLE DOT) is a letter...
I guess I don't see why C99 would say . is a valid identifier character if it isn't an alpha. It's all confusing to me, and I think needlessly complicated. Is \u00B7 the only difference?
No, see attachment. Format: "[first_in_range, last_in_range]," Thomas begin 644 isalpha.zip M+G%JI/$E("(JYE]_/_\$]^>_SS^Q_>_/?_XZG^E^9OUL[??9]+/'WV>/^GE_ M[?KK>'Z?,^%SEM_GUL]]/]_A\/V.=,$D6%?_78% "5 `/M\!?9X*RK5Q&H(B MH!$,`8/ &O5-A_7-7S`XRKQ^^*W3]#O_0"BJ$LKU(SH/$+U$,:H?,58!58$$ M-I:L0$:)2SV-6\!^=2G>JY(\UR9<B10(TIU+2EM!%I6ICJ49!>A<TA*P$H'8 MP&7(XMAI`.IS):K7(%<_!>BP-=YA3T-P\Z,F';;F?D%6UVNY`:I,F%KN]&NE MC7H3IK:B0'9)[93HKX"7(`F IUT\[?1TW&6HS,(Z;^+628DI?BRZ+ E3MX+V M7-?;P]W[7)7&A6KNNMY>=;V]LL$])?PKP!.(49\(B "=2Y-<;W$0+`$ZEY;% MVRFSG9S<3+A):50F-ROOUBHJ/#YG]P(ZP12 KD]<ST-MK.>FU&+B+DG<Q: O MN7!6H$H1E?(2!`&!(`LH!$U`(Y "U+%5Q4;-!%6`)MUJ,FRC8RT*B`0R;*,- M6=O5.H$,R_-CR?FQN/JK0X*.2<36X.2D9%D60CFD%P_I_5R5TRB0$.Z0":J4 M;D'M)9+=[]RFU`O%-Q2BT(`F$5Q]63*=; $RQ0C%:(H)BL6D"J2*255(<;E/ M]P6*AC!B7T0#BL,4!Q0MA.^`XC3OI=QUGOO\=%^ :`A2++B=E]+HM)L(B^:W M9IX+LEE=2ERA4_8)RG0B%4CQ1'<)MA(+*Y>Q0IEGU.DN08XIEU](>4,!*!K" M<F3S*R?8RHP$RC^7N=E<GEG0S(8&$&T5.%$RI0IL%9[CKF)"U5$1E9^K+$%= MS4`L0UR%K=J8OG5(".O B`WFFV. 42<XN^1=#T"!&WE"<78N[0):E5(+*[1L M&=4MU>YIAZ$%Q)VVL:WV,,4!Q6&*.*PVGS2G"ZEIKBY(+5/<,&][:&\H\I1[ M4`$JAC#B^[$E$W*L+TYW`'&.+L"O8%)A`2TBJ3I/2[^&Y.HY)2DU/%`TE(!X MK!-H$S5(\<+WN/!/2RD')\[E1 0G7*&K3D+H7<]$&\AL!5FT,'G7!AQ\=;-" MOB7X!.1Q^^L70&^E\^.'POF!&S!\H);G3_S`HO!3CGMUR;MET"L,C\'P*JP& MMT)S22N\I[KOTP:P/1:0YA0Z_X%18?Y`/*#LSZ[?XP7UOT^,_-2`S!"#P9 ` M^6_)K[\4;H/M!;2'T^E'A>D#J\*/S:8V^;_$K]\5] ^< +;$IX^!HBW<Z6>% M[0.A'IU-,[JD<'P 7(K>7(J^*[3013P93N=C,ZC-D#ZP`?+?VE\?H4O-G$]X MS"BW#VP MR8#F.L4$``#I MESO6Z"8,A.MD%5G`+8QYV"HE`9O(R3[N\N.?04.:5)CO"`$:)/`??U^_2_KU MU_6[ZC^__OSCZ[;5?:*K^M/5Z%K]Z5J-[K.Z3W27L86Q7S_=WG:WOS_=&=VY MNFF&KS37Z#N%M[NN\;>^`0S`#C M>.5R N6CE!F PJ+2HL+BC:476;,4CP`5?P!>` . #VRNC M<2Q%UN$7YJV B(C1PFX`KL,P+;456R=(>`H%IU!X" 6G4)R;0P))/P`+&USZ MP"PSSKI>:QU?0U`!6/ZNY4-3+%W3VIS>L0Z]42$S+3*<YDP`I[D1O`!"L':K MA>LH#P!G*080FU/DBS*1%8G\-0$ I5)*?<H"3\14H:U26X6V^AI!!^"T$%N% M(8382K%UWQ 46R&V&F?!):(46U&UU:-<*,16IP7$5HJM2*"O"8#BJ.,`;'_$ MSYD6 A>`/A1#SI-AOQFHOB'5C47:')MCJILCZE3?!GQ06X>V3FT=VCJU==R$ M3FT]K>T[M75HZY32D;=.*1UIZB5VZZ4!T$>!#TKIN"R<RCF4<RKG4,ZIG$,Y M9X%U*.=4SJ&<,^<<.>=,,4>*.8/L"+*SGCJ"[`RR(\C.(#O>'\[WAR/%?'`O M8EK-!!6 $F!:I0^H/Y1+5TRKH=S`TVGP.`P\G0:?3 ,OE,'S,7`^OH; !>`0 M&F? W.[G<3_WP!E6]Y77ANZK7$3/1D]DZ9U0Z7[: YZ-GH/V+VW*1*EN5`]Z M-^I$=P&ZRT%MHW;0'LCB^7WN&3-_>%/.&QTKU,<[L;CE"Y7Z:Q\B_`I]K1"5 <%$55>```4$L%! `````"``(`B````%,,```````` ` end -----BEGIN PGP SIGNATURE----- iD8DBQFFFIBKLK5blCcjpWoRAn+iAJ9Eh/wIVuebe7U4ADbXE3FAHumBVACgoC3b PBzvmjyVX6kOba+Ie2KozzE= =gjQb -----END PGP SIGNATURE-----
Sep 22 2006
prev sibling parent reply Kevin Bealer <kevinbealer gmail.com> writes:
Walter Bright wrote:
 Thomas Kuehne wrote:
 Walter Bright schrieb am 2006-09-22:
 What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
Thank-you.
...
 Task at hand: Create a table of all characters used by humans all over
 the world and minimize friction due to political issues
 (e.g. characters' names). Except for bug fixes (typos...) the unicode 
 people
 usually only extend previous versions of the standard.
Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.
I think the big-alphabet languages tend to coin new letters somewhat like other languages do words (but maybe less frequently), but I'm not sure about that. I have heard, though, that Chinese was simplified to a smaller set with different appearances during the revolution and the various political upheavals since. They have been adding letters back since as they discover they are really needed -- so these get put into Unicode. If you've read "1984" by Orwell, it's something like the motivation for NewSpeak. Old literature is written in the old letters, and is disappearing because the public can't read it. It's a kind of history censorship - you can't translate the old Chinese literature because they want to destroy the old culture as it competes philosophically with Communism. Essentially, they didn't have to burn all the old books -- they just burned all the old printing presses. Kevin
Sep 23 2006
parent reply Kristian <kjkilpi gmail.com> writes:
On Sat, 23 Sep 2006 11:40:08 +0300, Kevin Bealer <kevinbealer gmail.com>  
wrote:

 Walter Bright wrote:
 Thomas Kuehne wrote:
 Walter Bright schrieb am 2006-09-22:
 What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
Thank-you.
...
 Task at hand: Create a table of all characters used by humans all over
 the world and minimize friction due to political issues
 (e.g. characters' names). Except for bug fixes (typos...) the unicode  
 people
 usually only extend previous versions of the standard.
Chinese, Japanese, and Korean are hardly obscure so I don't see why the character sets for them seem to need large numbers of additions this late in the game.
I think the big-alphabet languages tend to coin new letters somewhat like other languages do words (but maybe less frequently), but I'm not sure about that. I have heard, though, that Chinese was simplified to a smaller set with different appearances during the revolution and the various political upheavals since. They have been adding letters back since as they discover they are really needed -- so these get put into Unicode. If you've read "1984" by Orwell, it's something like the motivation for NewSpeak. Old literature is written in the old letters, and is disappearing because the public can't read it.
 It's a kind of history censorship - you can't translate the old Chinese  
 literature because they want to destroy the old culture as it competes  
 philosophically with Communism.

 Essentially, they didn't have to burn all the old books -- they just  
 burned all the old printing presses.

 Kevin
If that's the case, I'm very sorry to hear that! :(
Sep 23 2006
parent reply Sean Kelly <sean f4.ca> writes:
Kristian wrote:
 On Sat, 23 Sep 2006 11:40:08 +0300, Kevin Bealer <kevinbealer gmail.com> 
 wrote:
 
 It's a kind of history censorship - you can't translate the old 
 Chinese literature because they want to destroy the old culture as it 
 competes philosophically with Communism.

 Essentially, they didn't have to burn all the old books -- they just 
 burned all the old printing presses.
If that's the case, I'm very sorry to hear that! :(
This is completely off-topic, but if you're interested in learning a bit about the Communist Revolution in China the fun way, go find the movie "To Live" in the foreign film section of your favorite video store. It's an excellent film that spans maybe 30 years of Chinese history, including the Communist Revolution. Sean
Sep 23 2006
parent reply Kristian <kjkilpi gmail.com> writes:
On Sat, 23 Sep 2006 19:01:36 +0300, Sean Kelly <sean f4.ca> wrote:

 Kristian wrote:
 On Sat, 23 Sep 2006 11:40:08 +0300, Kevin Bealer  
 <kevinbealer gmail.com> wrote:

 It's a kind of history censorship - you can't translate the old  
 Chinese literature because they want to destroy the old culture as it  
 competes philosophically with Communism.

 Essentially, they didn't have to burn all the old books -- they just  
 burned all the old printing presses.
If that's the case, I'm very sorry to hear that! :(
This is completely off-topic, but if you're interested in learning a bit about the Communist Revolution in China the fun way, go find the movie "To Live" in the foreign film section of your favorite video store. It's an excellent film that spans maybe 30 years of Chinese history, including the Communist Revolution. Sean
Thanks for the tip.
Sep 25 2006
parent reply Kevin Bealer <kevinbealer gmail.com> writes:
Kristian wrote:
 On Sat, 23 Sep 2006 19:01:36 +0300, Sean Kelly <sean f4.ca> wrote:
 
 Kristian wrote:
 On Sat, 23 Sep 2006 11:40:08 +0300, Kevin Bealer 
 <kevinbealer gmail.com> wrote:

 It's a kind of history censorship - you can't translate the old 
 Chinese literature because they want to destroy the old culture as 
 it competes philosophically with Communism.

 Essentially, they didn't have to burn all the old books -- they just 
 burned all the old printing presses.
If that's the case, I'm very sorry to hear that! :(
This is completely off-topic, but if you're interested in learning a bit about the Communist Revolution in China the fun way, go find the movie "To Live" in the foreign film section of your favorite video store. It's an excellent film that spans maybe 30 years of Chinese history, including the Communist Revolution. Sean
Thanks for the tip.
Yes - I really enjoyed that movie. The site where I got the history of this, which I tried to summarize above, was a unicode related article. What I wrote above is somewhat negative (intentionally) toward the PRC -- I don't take any of that back, but I thought I should post the link as well. It also has some interesting unicode related info (which is maybe marginally on-topic?) but the technical stuff might be out-dated. http://www.hastingsresearch.com/net/04-unicode-limitations.shtml Kevin
Sep 26 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kevin Bealer schrieb am 2006-09-26:
 Kristian wrote:
<snip>
 It also has some interesting unicode related info (which is maybe 
 marginally on-topic?) but the technical stuff might be out-dated.

 http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
The technical stuff is way outdated. The article is based on version 3, the current one is 5. Version 4 did fix most of the CJK issues, however the compatibility ideographs and variant selectors might turn out to be monsters like the infamous tags (0xE0001, 0xE0020 - 0xE007F). Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFGPXjLK5blCcjpWoRAva1AKCEHB62SU0D6PV30FtHBaiPMvDGzwCgpKC4 XU1sRteQUGW3XXL7RfVKUuw= =Rl30 -----END PGP SIGNATURE-----
Sep 26 2006
prev sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Kuehne schrieb am 2006-09-22:
 Walter Bright schrieb am 2006-09-22:
 Thomas Kuehne wrote:
<snip>
 I'd like to leave things as they are for 1.0. I don't think that 
 anyone's code will be adversely affected by not having the latest alpha 
 character additions to identifiers, and I also don't think math 
 characters should be part of identifiers. What is CJK?
CJK: Chinese, Japanese & Korean 0x20000 .. 0x2A6D6 CJK Ideograph Extension B 0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS
A closer look reveals that Appendix D is also missing (among many others): 0x0712 .. 0x072F SYRIAC LETTER 0x1200 .. 0x1248 ETHIOPIC SYLLABLE 0x13A0 .. 0x13F4 CHEROKEE LETTER 0x3400 .. 0x4DB5 CJK Ideograph Extension A 0xA016 .. 0xA48C YI SYLLABLE 0xF900 .. 0xFAD9 CJK COMPATIBILITY IDEOGRAPH 0xFB46 .. 0xFBB1 HEBREW / ARABIC LETTER 0xFF21 .. 0xFF3A FULLWIDTH LATIN CAPITAL LETTER 0xFF41 .. 0xFF5A FULLWIDTH LATIN SMALL LETTER Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFFGmlLK5blCcjpWoRAlbeAJsHDZbaU/NlcHy2NMelqT3JfVN4WgCffOAc ws0wT61MxHAUV6f7viBW8hU= =uM8P -----END PGP SIGNATURE-----
Sep 22 2006