digitalmars.D.learn - Chars ASCII 128+ in Literals?
- AEon (12/12) Mar 29 2005 For a short moment I had feared all my char[] based code would not work
- Regan Heath (7/18) Mar 29 2005 Are you saving this source file as UTF-8?
- AEon (8/10) Mar 29 2005 Hmmm...UltraEdit has:
- Regan Heath (26/36) Mar 30 2005 UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode
- AEon (16/62) Mar 30 2005 Thanx for explaining that. I had skimmed over it in the manual and did
- Carlos Santander B. (8/15) Mar 30 2005 I think linux supports it natively (at least I haven't had any problems
- Chris Sauls (8/14) Mar 29 2005 Try this:
- psychotic (7/7) Apr 04 2005 Maybe you will find this example i posted on dsource.org interesting, al...
For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work... But when trying to test my function, via: char[] t = "moo"; -> aepar.d(69): invalid UTF-8 sequence I then tried this, since it may be a mapping issue: ubyte[] t = "moo"; char[] tc = cast(char[]) t; writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\""); But that does not help either. It seems that string literals cannot contain character that are > ASCII 127? Hmmm... AEon
Mar 29 2005
On Tue, 29 Mar 2005 23:06:29 +0200, AEon <aeon2001 lycos.de> wrote:For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work... But when trying to test my function, via: char[] t = "�Æmoo"; -> aepar.d(69): invalid UTF-8 sequence I then tried this, since it may be a mapping issue: ubyte[] t = "�Æmoo"; char[] tc = cast(char[]) t; writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\""); But that does not help either. It seems that string literals cannot contain character that are > ASCII 127?Are you saving this source file as UTF-8? D source files *must* be saved as on of the UTF variants. Regan p.s. I would use , instead of ~ in the writefln above as ~ will append the strings together forming extra temporary strings whereas , will simply print one at a time creating no extra temporary strings.
Mar 29 2005
Regan Heath wrote:Are you saving this source file as UTF-8? D source files *must* be saved as on of the UTF variants.Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm AEon
Mar 29 2005
On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:Regan Heath wrote:UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode character. unicode ("universal encoding") has an encoding for every known existing character in every language in the world. (or at least that's the idea). UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a single character. UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a single character. UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a single character. UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In short BE means the MSB (most significant bits) of the code unit appear first, followed by the LSB (least significant bits). LE is the opposite of BE. It's my understanding that if you save your source file as UTF-8 then if it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so you should be able to say: char[] t = "�Æmoo"; and it should compile and run. The next trick/problem is that the console you write it to must support UTF-8 and be in UTF-8 mode. By default the windows and I believe unix consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am not sure how this is achieved, hopefully someone else will fill us both in :) ReganAre you saving this source file as UTF-8? D source files *must* be saved as on of the UTF variants.Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm
Mar 30 2005
Regan Heath wrote:On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:Thanx for explaining that. I had skimmed over it in the manual and did note really find it relevant. I am probably too much of an old-timer, who things 8bit is good enough ;)Regan Heath wrote:UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode character. unicode ("universal encoding") has an encoding for every known existing character in every language in the world. (or at least that's the idea). UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a single character. UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a single character. UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a single character. UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In short BE means the MSB (most significant bits) of the code unit appear first, followed by the LSB (least significant bits). LE is the opposite of BE.Are you saving this source file as UTF-8? D source files *must* be saved as on of the UTF variants.Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmmIt's my understanding that if you save your source file as UTF-8 then if it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so you should be able to say: char[] t = "�Æmoo"; and it should compile and run.I will do some tests. Possibly I did not create a "new" file, and thus inherited some DOS or default windows txt format, that is not UTF-8. In that case UltraEdit would not change it to UTF-8.The next trick/problem is that the console you write it to must support UTF-8 and be in UTF-8 mode. By default the windows and I believe unix consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am not sure how this is achieved, hopefully someone else will fill us both in :)I noted this when running my ASCII 0-255 test string on my convertion function. The DOS console will not properly show the Æ (possible a font issue), if will show some weird looking "F" character. When you copy/paste the console output, to UltraEdit the Æ is presented as |, but if you divert the exe's output to a text time (e.g. aepar -q3a > .tmp). Then load .tmp in UltraEdit, the Æ is properly shown. So, as you pointed out, there are a few issues between the DOS console and chars > ASCII 127. AEon
Mar 30 2005
Regan Heath wrote:The next trick/problem is that the console you write it to must support UTF-8 and be in UTF-8 mode. By default the windows and I believe unix consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am not sure how this is achieved, hopefully someone else will fill us both in :)I think linux supports it natively (at least I haven't had any problems with UTF-8 output, even if I had to configure it on Ubuntu). On Windows (I don't know if all Windows), you can do it like this: "chcp 65001". By default, it isn't set. I don't know how to make it the default codepage.Regan_______________________ Carlos Santander Bernal
Mar 30 2005
AEon wrote:For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work... But when trying to test my function, via: char[] t = "moo"; -> aepar.d(69): invalid UTF-8 sequenceTry this: This is an example of "Named Character Entities," a new feature as of DMD 0.116 -- Read more here: http://www.digitalmars.com/d/lex.html#EscapeSequence http://www.digitalmars.com/d/entity.html -- Chris Sauls
Mar 29 2005
Maybe you will find this example i posted on dsource.org interesting, although, not cross-platform. [ http://www.dsource.org/tutorials/index.php?show_example=147 ]. To quote the desctiption: "This Windows specific code, allows you to print international characters (Greek for instance) on the non UTF-8 Windows console". Best Regards ~psychotic
Apr 04 2005