digitalmars.D.learn - Chars ASCII 128+ in Literals?

AEon (12/12) Mar 29 2005 For a short moment I had feared all my char[] based code would not work

Regan Heath (7/18) Mar 29 2005 Are you saving this source file as UTF-8?

AEon (8/10) Mar 29 2005 Hmmm...UltraEdit has:

Regan Heath (26/36) Mar 30 2005 UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode

AEon (16/62) Mar 30 2005 Thanx for explaining that. I had skimmed over it in the manual and did
Carlos Santander B. (8/15) Mar 30 2005 I think linux supports it natively (at least I haven't had any problems

Chris Sauls (8/14) Mar 29 2005 Try this:

psychotic (7/7) Apr 04 2005 Maybe you will find this example i posted on dsource.org interesting, al...

AEon <aeon2001 lycos.de> writes:

For a short moment I had feared all my char[] based code would not work 
with ASCII characters > 127. But that seems to work...

But when trying to test my function, via:

char[]  t = "��moo";   ->	aepar.d(69): invalid UTF-8 sequence


I then tried this, since it may be a mapping issue:

ubyte[] t = "��moo";
char[] tc = cast(char[]) t;
writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

But that does not help either.

It seems that string literals cannot contain character that are > ASCII 127?

Hmmm...

AEon

Mar 29 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Tue, 29 Mar 2005 23:06:29 +0200, AEon <aeon2001 lycos.de> wrote:
 For a short moment I had feared all my char[] based code would not work  
 with ASCII characters > 127. But that seems to work...

 But when trying to test my function, via:

 char[]  t = "�Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence


 I then tried this, since it may be a mapping issue:

 ubyte[] t = "�Æmoo";
 char[] tc = cast(char[]) t;
 writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

 But that does not help either.

 It seems that string literals cannot contain character that are > ASCII  
 127?

Are you saving this source file as UTF-8?
D source files *must* be saved as on of the UTF variants.

Regan

p.s. I would use , instead of ~ in the writefln above as ~ will append the  
strings together forming extra temporary strings whereas , will simply  
print one at a time creating no extra temporary strings.

Mar 29 2005

AEon <aeon2001 lycos.de> writes:

Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has:

Auto detec UTF-8 files (On)
Write UTF-8 BOOM header to all UTF-8 files when saved (On)
Write UTF-8 on new files created with UltraEdit (if above is not set) (On)

But since I don't even know what UTF-8 supposed to be, and why that 
should matter... hmmm

AEon

Mar 29 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:
 Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.

 Hmmm...UltraEdit has:

 Auto detec UTF-8 files (On)
 Write UTF-8 BOOM header to all UTF-8 files when saved (On)
 Write UTF-8 on new files created with UltraEdit (if above is not set)  
 (On)

 But since I don't even know what UTF-8 supposed to be, and why that  
 should matter... hmmm

UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode  
character. unicode ("universal encoding") has an encoding for every known  
existing character in every language in the world. (or at least that's the  
idea).

UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a  
single character.
UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a  
single character.
UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a  
single character.

UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In  
short BE means the MSB (most significant bits) of the code unit appear  
first, followed by the LSB (least significant bits). LE is the opposite of  
BE.

It's my understanding that if you save your source file as UTF-8 then if  
it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so  
you should be able to say:

char[] t = "�Æmoo";

and it should compile and run.

The next trick/problem is that the console you write it to must support  
UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am  
not sure how this is achieved, hopefully someone else will fill us both in  
:)

Regan

Mar 30 2005

AEon <aeon2001 lycos.de> writes:

Regan Heath wrote:
 On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001 lycos.de> wrote:
 
 Regan Heath wrote:

 Are you saving this source file as UTF-8?
 D source files *must* be saved as on of the UTF variants.


 Hmmm...UltraEdit has:

 Auto detec UTF-8 files (On)
 Write UTF-8 BOOM header to all UTF-8 files when saved (On)
 Write UTF-8 on new files created with UltraEdit (if above is not set)  
 (On)

 But since I don't even know what UTF-8 supposed to be, and why that  
 should matter... hmmm

 
 UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode  
 character. unicode ("universal encoding") has an encoding for every 
 known  existing character in every language in the world. (or at least 
 that's the  idea).
 
 UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a  
 single character.
 UTF-16 uses 16 bit code units, 1 or more code unit is used to represent 
 a  single character.
 UTF-32 uses 32 bit code units, 1 or more code unit is used to represent 
 a  single character.
 
 UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. 
 In  short BE means the MSB (most significant bits) of the code unit 
 appear  first, followed by the LSB (least significant bits). LE is the 
 opposite of  BE.

Thanx for explaining that. I had skimmed over it in the manual and did 
note really find it relevant. I am probably too much of an old-timer, 
who things 8bit is good enough ;)

 It's my understanding that if you save your source file as UTF-8 then 
 if  it contains "�Æmoo" the literal will be saved as a valid UTF-8 
 sequence so  you should be able to say:
 
 char[] t = "�Æmoo";
 
 and it should compile and run.

I will do some tests. Possibly I did not create a "new" file, and thus 
inherited some DOS or default windows txt format, that is not UTF-8. In 
that case UltraEdit would not change it to UTF-8.

 The next trick/problem is that the console you write it to must support  
 UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
 consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
 am  not sure how this is achieved, hopefully someone else will fill us 
 both in  :)

I noted this when running my ASCII 0-255 test string on my convertion 
function. The DOS console will not properly show the Æ (possible a font 
issue), if will show some weird looking "F" character. When you 
copy/paste the console output, to UltraEdit the Æ is presented as |, but 
if you divert the exe's output to a text time (e.g. aepar -q3a > .tmp). 
Then load .tmp in UltraEdit, the Æ is properly shown.

So, as you pointed out, there are a few issues between the DOS console 
and chars > ASCII 127.

AEon

Mar 30 2005

"Carlos Santander B." <csantander619 gmail.com> writes:

Regan Heath wrote:
 The next trick/problem is that the console you write it to must support  
 UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
 consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
 am  not sure how this is achieved, hopefully someone else will fill us 
 both in  :)
 

I think linux supports it natively (at least I haven't had any problems 
with UTF-8 output, even if I had to configure it on Ubuntu).

On Windows (I don't know if all Windows), you can do it like this: "chcp 
65001". By default, it isn't set. I don't know how to make it the 
default codepage.

 Regan

_______________________
Carlos Santander Bernal

Mar 30 2005

Chris Sauls <ibisbasenji gmail.com> writes:

AEon wrote:
 For a short moment I had feared all my char[] based code would not work 
 with ASCII characters > 127. But that seems to work...
 
 But when trying to test my function, via:
 
 char[]  t = "��moo";   ->    aepar.d(69): invalid UTF-8 sequence

Try this:


This is an example of "Named Character Entities," a new feature as of 
DMD 0.116 -- Read more here:
http://www.digitalmars.com/d/lex.html#EscapeSequence
http://www.digitalmars.com/d/entity.html

-- Chris Sauls

Mar 29 2005

psychotic <psychotic_member pathlink.com> writes:

Maybe you will find this example i posted on dsource.org interesting, although,
not cross-platform. [
http://www.dsource.org/tutorials/index.php?show_example=147 ]. To quote the
desctiption: "This Windows specific code, allows you to print international
characters (Greek for instance) on the non UTF-8 Windows console".

Best Regards
~psychotic

Apr 04 2005

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Chars ASCII 128+ in Literals?