www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Chars ASCII 128+ in Literals?

reply AEon <aeon2001 lycos.de> writes:
For a short moment I had feared all my char[] based code would not work 
with ASCII characters > 127. But that seems to work...

But when trying to test my function, via:

char[]  t = "moo";   ->	aepar.d(69): invalid UTF-8 sequence


I then tried this, since it may be a mapping issue:

ubyte[] t = "moo";
char[] tc = cast(char[]) t;
writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

But that does not help either.

It seems that string literals cannot contain character that are > ASCII 127?

Hmmm...

AEon
Mar 29 2005
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Tue, 29 Mar 2005 23:06:29 +0200, AEon <aeon2001 lycos.de> wrote:
 For a short moment I had feared all my char[] based code would not work  
 with ASCII characters > 127. But that seems to work...

 But when trying to test my function, via:

 char[]  t = "�Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence


 I then tried this, since it may be a mapping issue:

 ubyte[] t = "�Æmoo";
 char[] tc = cast(char[]) t;
 writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

 But that does not help either.

 It seems that string literals cannot contain character that are > ASCII  
 127?

Are you saving this source file as UTF-8? D source files *must* be saved as on of the UTF variants. Regan p.s. I would use , instead of ~ in the writefln above as ~ will append the strings together forming extra temporary strings whereas , will simply print one at a time creating no extra temporary strings.
Mar 29 2005
parent reply AEon <aeon2001 lycos.de> writes:
Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm AEon
Mar 29 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:
 Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm

UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode character. unicode ("universal encoding") has an encoding for every known existing character in every language in the world. (or at least that's the idea). UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a single character. UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a single character. UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a single character. UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In short BE means the MSB (most significant bits) of the code unit appear first, followed by the LSB (least significant bits). LE is the opposite of BE. It's my understanding that if you save your source file as UTF-8 then if it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so you should be able to say: char[] t = "�Æmoo"; and it should compile and run. The next trick/problem is that the console you write it to must support UTF-8 and be in UTF-8 mode. By default the windows and I believe unix consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am not sure how this is achieved, hopefully someone else will fill us both in :) Regan
Mar 30 2005
next sibling parent AEon <aeon2001 lycos.de> writes:
Regan Heath wrote:
 On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:
 
 Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has: Auto detec UTF-8 files (On) Write UTF-8 BOOM header to all UTF-8 files when saved (On) Write UTF-8 on new files created with UltraEdit (if above is not set) (On) But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm

UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode character. unicode ("universal encoding") has an encoding for every known existing character in every language in the world. (or at least that's the idea). UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a single character. UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a single character. UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a single character. UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In short BE means the MSB (most significant bits) of the code unit appear first, followed by the LSB (least significant bits). LE is the opposite of BE.

Thanx for explaining that. I had skimmed over it in the manual and did note really find it relevant. I am probably too much of an old-timer, who things 8bit is good enough ;)
 It's my understanding that if you save your source file as UTF-8 then 
 if  it contains "�Æmoo" the literal will be saved as a valid UTF-8 
 sequence so  you should be able to say:
 
 char[] t = "�Æmoo";
 
 and it should compile and run.

I will do some tests. Possibly I did not create a "new" file, and thus inherited some DOS or default windows txt format, that is not UTF-8. In that case UltraEdit would not change it to UTF-8.
 The next trick/problem is that the console you write it to must support  
 UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
 consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
 am  not sure how this is achieved, hopefully someone else will fill us 
 both in  :)

I noted this when running my ASCII 0-255 test string on my convertion function. The DOS console will not properly show the Æ (possible a font issue), if will show some weird looking "F" character. When you copy/paste the console output, to UltraEdit the Æ is presented as |, but if you divert the exe's output to a text time (e.g. aepar -q3a > .tmp). Then load .tmp in UltraEdit, the Æ is properly shown. So, as you pointed out, there are a few issues between the DOS console and chars > ASCII 127. AEon
Mar 30 2005
prev sibling parent "Carlos Santander B." <csantander619 gmail.com> writes:
Regan Heath wrote:
 The next trick/problem is that the console you write it to must support  
 UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
 consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
 am  not sure how this is achieved, hopefully someone else will fill us 
 both in  :)
 

I think linux supports it natively (at least I haven't had any problems with UTF-8 output, even if I had to configure it on Ubuntu). On Windows (I don't know if all Windows), you can do it like this: "chcp 65001". By default, it isn't set. I don't know how to make it the default codepage.
 Regan

_______________________ Carlos Santander Bernal
Mar 30 2005
prev sibling parent reply Chris Sauls <ibisbasenji gmail.com> writes:
AEon wrote:
 For a short moment I had feared all my char[] based code would not work 
 with ASCII characters > 127. But that seems to work...
 
 But when trying to test my function, via:
 
 char[]  t = "moo";   ->    aepar.d(69): invalid UTF-8 sequence

Try this: # char[] t = "\&AElig;moo"; This is an example of "Named Character Entities," a new feature as of DMD 0.116 -- Read more here: http://www.digitalmars.com/d/lex.html#EscapeSequence http://www.digitalmars.com/d/entity.html -- Chris Sauls
Mar 29 2005
parent psychotic <psychotic_member pathlink.com> writes:
Maybe you will find this example i posted on dsource.org interesting, although,
not cross-platform. [
http://www.dsource.org/tutorials/index.php?show_example=147 ]. To quote the
desctiption: "This Windows specific code, allows you to print international
characters (Greek for instance) on the non UTF-8 Windows console".

Best Regards
~psychotic 
Apr 04 2005