www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Chars and Strs

reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Here's another long documentation essay,
on the other "missing" D type: strings...

http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs


I'll add some D sample code on how to convert
to and from legacy encodings (manually) later.
http://www.algonet.se/~afb/d/mapping.zip
(using ftp://ftp.unicode.org/Public/MAPPINGS/)

And some character tables for US-ASCII and Latin-1,
http://www.algonet.se/~afb/d/latin1/iso-8859-1.html
Also needed is how to talk to the Windows console,
http://www.digitalmars.com/techtips/windows_utf.html


But that can wait until after I get back from vacation :-)
Any comments can only make it better, here or on Wiki4D...

Share and Enjoy,
--anders
Feb 11 2005
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Hi, Anders,

I am looking on:

"All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" 
for a real code point, and *must be occur in pairs that can then be combined 
to form the real Unicode code unit*. The lower byte of the code units 
0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F 
is the same as ASCII. They are also called "wide characters", by some 
operating systems."

Stuff in *...* (my mark) technically speaking is not the case.

UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not 
need to "must occur in pairs".
It depends on use case: Programm A supports only UCS-2 and programm B 
supports UCS-4.

(BMP) The first plane defined in Unicode/ISO 10646, designed to include all 
scripts in active modern use. The BMP currently includes the Latin, Greek, 
Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, among others, 
and a large body of mathematical, APL-related, and other miscellaneous 
characters. Most of the Han ideographs in current use are present in the 
BMP, but due to the large number of ideographs, many were placed in the 
Supplementary Ideographic Plane.

Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of 
functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.

All modern browsers has UCS-2 as their internal representation. JavaScript, 
Java are also UCS-2 only (by their specs)  languages.

--------------------------------------------------------------------
I've found D way treating strings as char[], dchar[] and qchar[] pretty 
reasonable as this allows to work with  text in most optimal way. The only 
thing I am not sure yet - string as an entity has its own methods. It is 
pretty traditional these days to use them as objects : s.substr(1,4). But in 
fact strings are atomic types so they should be handled as any other native 
types e.g. int.
Personally I think that substr(s,1.4) is more "honest" than s.substr(1,4). 
Some aesthetical concerns though. But as deeper I am looking in the "string 
problem" as I more I am thinking that strings are not the objects in Java/C# 
sense. E.g. inability to work with strings as sequences(arrays) of 
characters is a source of many bottlenecks in these languages.

Andrew Fedoniouk.
http://terrainformatica.com





"Anders F Björklund" <afb algonet.se> wrote in message 
news:cuifu3$23kd$1 digitaldaemon.com...
 Here's another long documentation essay,
 on the other "missing" D type: strings...

 http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs


 I'll add some D sample code on how to convert
 to and from legacy encodings (manually) later.
 http://www.algonet.se/~afb/d/mapping.zip
 (using ftp://ftp.unicode.org/Public/MAPPINGS/)

 And some character tables for US-ASCII and Latin-1,
 http://www.algonet.se/~afb/d/latin1/iso-8859-1.html
 Also needed is how to talk to the Windows console,
 http://www.digitalmars.com/techtips/windows_utf.html


 But that can wait until after I get back from vacation :-)
 Any comments can only make it better, here or on Wiki4D...

 Share and Enjoy,
 --anders 

Feb 11 2005
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Andrew Fedoniouk wrote:

 "All UTF-16 code units from 0xD800-0xDFFF are similarly just "surrogates" 
 for a real code point, and *must be occur in pairs that can then be combined 
 to form the real Unicode code unit*. The lower byte of the code units 
 0x0000-0x00FF are exactly the same as the ISO-8859-1 encoding, and 0x00-0x7F 
 is the same as ASCII. They are also called "wide characters", by some 
 operating systems."
 
 Stuff in *...* (my mark) technically speaking is not the case.

Hmm, doesn't even seem to be a real sentence :-) "must be occur"
 UTF-16 corresponds to UCS-2 (Basic Multilanguage Plane - BMP) and does not 
 need to "must occur in pairs".
 It depends on use case: Programm A supports only UCS-2 and programm B 
 supports UCS-4.

What I meant to say was that *surrogates* need to be in pairs... (0xD800-0xDFFF) Not all the other individual UTF-16 code units. Got it from http://www.unicode.org/faq/utf_bom.html#UTF16
 Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of 
 functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.
 
 All modern browsers has UCS-2 as their internal representation. JavaScript, 
 Java are also UCS-2 only (by their specs)  languages.

Right, the "wide characters" should be mentioned down by the Z stuff... --anders
Feb 11 2005
prev sibling parent Roald Ribe <rr.nospam nospam.teikom.no> writes:
Andrew Fedoniouk wrote:

 Windows natively uses UCS-2 (win32::wchar_t, "widechar") and only couple of 
 functions there (AFAIK) can treat widechars as sequences of UTF-16 codes.

Most of the WIN32 API has two entries for each function. The 8 bit character API functions has A appended to their names, and the 16 bit character funcs has W appended. This how each application can choose which API to use. * On NT based kernels the 16 bit char is what is used natively, and the A API's just convert from currently selected codepage into unicode before calling the W API. * On 9x/Me kernels, the W API's comes as redistributable DLL's (apps can include them in their installer). In these systems the W API just converts the unicode strings/chars into current codepage (where possible) and then calls the native A API's. So to conclude: Most (all?) of the WIN32 API is available in both 8 and 16 bits versions. The only exception may be WIN32s, but I do not think anyone uses that for new software releases (if they ever did). Roald
Feb 12 2005