www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - ICU (International Components for Unicode)

reply Arcane Jill <Arcane_member pathlink.com> writes:
Following Ben's mention of ICU, I've been checking out what it does and doesn't
do. Basically, it does EVERYTHING. There is the work of years there. Not just
Unicode stuff, but a whole swathe of classes for internationalization and
transcoding. It would take me a very long time to duplicate that. It's also
free, open source, and with a license which basically says there's no problem
with our using it.

So I'm thinking seriously about ditching the whole etc.unicode project and
replacing it with a wrapper around ICU.

It's not completely straightforward. ICU is written in C++ (not C), and so we
can't link against it directly. It uses classes, not raw functions. So, I'd have
to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
to write a D wrapper to call the C wrapper - at which point we could get the
classes back again (and our own choice of architecture, so plugging into std or
mango streams won't suffer).

But the outermost (D) wrapper can, at least, be composed of D classes.

If we want D to be the language of choice for Unicode, we would need all this
functionality. So, if we went the ICU route, we'd need to bundle ICU (currently
a ten megabyte zip file) with DMD, along with whatever wrapper I come up with.
(etc.unicode is not likely to be smaller).

I'd like to see some discussion on this. Read this page to inform yourself:
http://oss.software.ibm.com/icu/userguide/index.html


Finally, back to strings. Ben was right. The ICU says:

"In order to take advantage of Unicode with its large character repertoire and
its well-defined properties, there must be types with consistent definitions and
semantics. The Unicode standard defines a default encoding based on 16-bit code
units. This is supported in ICU by the definition of the UChar to be an unsigned
16-bit integer type. This is the base type for character arrays for strings in
ICU."

Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special
string class - a string is just an array of wchars. So obviously, if we go the
ICU route, I withdraw my suggestion to ditch the wchar.

What I now recommend is:

(1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
around ICU). Eventually I hope for this to change into "std.icu" (as I
originally hoped that "etc.unicode" would turn into "std.unicode").

(2) Ditch the char. 8-bits is really too small for a character these days,
honestly, and all previous arguments still apply. The existence of char only
encourages ASCII and discourages Unicode anyway.

(3) Native D strings shall be arrays of wchars. This means that
Object.toString() must return a wchar[], and string literals in D source must
compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce
this, there should probably be an



in object.d.

(4) We retain dchar (so that we can get character properties), but all string
code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly
to dchar.

(5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s
instead of dchar[]s.

(6) That's pretty much it, although once "char" is gone, we could rename "wchar"
as "char" (a la Java).


Discussion please? And I really do want this talked through because it affects D
work I'm currently involved in.

Input is also requested from Walter - in particular the request that
Object.toString() be re-jigged to return wchar[] instead of char[].

Okay, let's chew this one over.

Jill
Aug 23 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgcv4n$2fsf$1 digitaldaemon.com>, Arcane Jill says...

It's not completely straightforward. ICU is written in C++ (not C), and so we
can't link against it directly. It uses classes, not raw functions. So, I'd have
to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
to write a D wrapper to call the C wrapper - at which point we could get the
classes back again (and our own choice of architecture, so plugging into std or
mango streams won't suffer).
Or I could port it to D! Or... In article <cgd0h9$2ggf$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...
AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement
about ICU is really contagious, only one question, are the C wrappers at
the same level than the C++/Java ones? If so it seems that with a little
easy and boring (compared to writing etc.unicode) wrapping we're going to
have a first-class Unicode lib :) => (i18n version of <g>)
I don't know, as I've only just started looking into it. But either way (port or wrap C) we move the time spend on development from a year or so down to only a few months. It really is worth thinking about, but it /does/ mean that D really should standardize on wchar[] strings, and this has consequences for (a) parsing string literals, and (b) Object.toString() - and probably a few other things too, not to mention all the code it would break, and the future (or not) of the char type. It's all this that I'd be concerned about, and should really be discussed by all of us in the D community, and Walter as its architect - not just those of us interested in Unicode. Arcane Jill
Aug 23 2004
prev sibling next sibling parent reply =?ISO-8859-1?Q?J=F6rg_R=FCppel?= <joerg sharky-x.de> writes:
Arcane Jill wrote:
 
 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.
 
 It's not completely straightforward. ICU is written in C++ (not C), and so we
 can't link against it directly. It uses classes, not raw functions. So, I'd
have
 to write a C wrapper around ICU which gave me a C interface, and /then/ I'd
have
 to write a D wrapper to call the C wrapper - at which point we could get the
 classes back again (and our own choice of architecture, so plugging into std or
 mango streams won't suffer).
According to the API docs at http://oss.software.ibm.com/icu/apiref/index.html there is a C API. Didn't you see that or is there a reason why that can't be used? Regards, Jörg
Aug 23 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgd5go$2j1k$1 digitaldaemon.com>, =?ISO-8859-1?Q?J=F6rg_R=FCppel?=
says...

According to the API docs at 
http://oss.software.ibm.com/icu/apiref/index.html there is a C API. 
Didn't you see that or is there a reason why that can't be used?
I just didn't see it, that's all. I think now that the best approach would be something which is part-port and part-wrapper around the C API. I would want the D interface to maintain the classes and so forth which are present in the C++ and Java APIs, so the D API's C wrappers would have to be part-port anyway, even if only to recreate the class heirarchy and put things back into member functions. Jill
Aug 23 2004
prev sibling next sibling parent =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:
Most of ICU it's written in C with C++ wrappers.



Arcane Jill wrote:
 Following Ben's mention of ICU, I've been checking out what it does and doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's also
 free, open source, and with a license which basically says there's no problem
 with our using it.
 
 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.
 
 It's not completely straightforward. ICU is written in C++ (not C), and so we
 can't link against it directly. It uses classes, not raw functions. So, I'd
have
 to write a C wrapper around ICU which gave me a C interface, and /then/ I'd
have
 to write a D wrapper to call the C wrapper - at which point we could get the
 classes back again (and our own choice of architecture, so plugging into std or
 mango streams won't suffer).
 
 But the outermost (D) wrapper can, at least, be composed of D classes.
 
 If we want D to be the language of choice for Unicode, we would need all this
 functionality. So, if we went the ICU route, we'd need to bundle ICU (currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up with.
 (etc.unicode is not likely to be smaller).
 
 I'd like to see some discussion on this. Read this page to inform yourself:
 http://oss.software.ibm.com/icu/userguide/index.html
 
 
 Finally, back to strings. Ben was right. The ICU says:
 
 "In order to take advantage of Unicode with its large character repertoire and
 its well-defined properties, there must be types with consistent definitions
and
 semantics. The Unicode standard defines a default encoding based on 16-bit code
 units. This is supported in ICU by the definition of the UChar to be an
unsigned
 16-bit integer type. This is the base type for character arrays for strings in
 ICU."
 
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special
 string class - a string is just an array of wchars. So obviously, if we go the
 ICU route, I withdraw my suggestion to ditch the wchar.
 
 What I now recommend is:
 
 (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
 Path: digitalmars.com!drn
 From: Arcane Jill <Arcane_member pathlink.com>
 Newsgroups: digitalmars.D
 Subject: Syntax for pinning
 Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
 Organization: [http://www.pathlink.com]
 Lines: 39
 Sender: usenet www.digitalmars.com
 Message-ID: <cg9ocj$9ce$1 digitaldaemon.com>
 X-Trace: digitaldaemon.com 1093166291 9614 63.105.9.61 (22 Aug 2004 09:18:11
GMT)
 X-Complaints-To: usenet digitalmars.com
 NNTP-Posting-Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
 X-Newsreader: Direct Read News v3.11a
 Xref: digitalmars.com digitalmars.D:9385
 
 In article <cfit27$snc$1 digitaldaemon.com>, Walter says...
 
 
I know how to build a gc that will compact
memory to deal with fragmentation when it does arise, so although this is
not in the current D it is not a technical problem to do it.
That's probably good news, but: Will it be possible to mark specific arrays on the heap as "immovable"?
(Please
say yes. I'm sure it is not a technical problem).
Yes. It's called "pinning". And I can guess why you need this feature <g>.
Walter, hi. Listen - while the prospect of a compacting GC may be a long way away, I'd like to be able to mark specific arrays on the heap as immovable right now, for forward compatibility, so that when the new GC comes along everything will still behave as intended. Could you perhaps reserve a keyword and/or some syntax for pinning? It only has around ICU). Eventually I hope for this to change into "std.icu" (as I originally hoped that "etc.unicode" would turn into "std.unicode"). (2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of char only encourages ASCII and discourages Unicode anyway. (3) Native D strings shall be arrays of wchars. This means that Object.toString() must return a wchar[], and string literals in D source must compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce this, there should probably be an in object.d. (4) We retain dchar (so that we can get character properties), but all string code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly to dchar. (5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s instead of dchar[]s. (6) That's pretty much it, although once "char" is gone, we could rename "wchar" as "char" (a la Java). Discussion please? And I really do want this talked through because it affects D work I'm currently involved in. Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[]. Okay, let's chew this one over. Jill
Aug 23 2004
prev sibling next sibling parent "antiAlias" <fu bar.com> writes:
This would be a great thing for D to adopt. Just a few things to note:

1) This would be best served as a DLL (given it's size). In fact, the team
apparently like to compile the string-resource files into DLLs (which makes
a lot of sense IMO). If D treated DLLs as first-class citizens, this would
be a no-brainer. Right now, that's not the case.

2) There's a rather nice String class (C++). That's a perfect candidate for
porting directly to D.

3) From what I've seen, the lib is mostly C. Even better, it eschews the
traditional morass of header files. Building a ICU.d import will be much
easier because of this. The project on dsource.org might handle that part
without issue? Wrapping those library functions with D shells would be nice,
if only to take advantage of D arrays.

4) the transcoders deal with arrays of buffered data, so they're efficient.
ICU has transcoders and code-page tables up-the-wazoo.

For those who haven't looked through the lib, it's far more than just
Unicode transcoders (as Jill notes): you get sophisticated and flexible date
& number parsing/formatting; I18N message ordering; BiDi support;
collation/sorting support; text-break algorithms; a text layout engine;
unicode regex; and much more.

It's a first-class suite of libraries, and an awesome resource to leverage.



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Following Ben's mention of ICU, I've been checking out what it does and
doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not
just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's
also
 free, open source, and with a license which basically says there's no
problem
 with our using it.

 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.

 It's not completely straightforward. ICU is written in C++ (not C), and so
we
 can't link against it directly. It uses classes, not raw functions. So,
I'd have
 to write a C wrapper around ICU which gave me a C interface, and /then/
I'd have
 to write a D wrapper to call the C wrapper - at which point we could get
the
 classes back again (and our own choice of architecture, so plugging into
std or
 mango streams won't suffer).

 But the outermost (D) wrapper can, at least, be composed of D classes.

 If we want D to be the language of choice for Unicode, we would need all
this
 functionality. So, if we went the ICU route, we'd need to bundle ICU
(currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up
with.
 (etc.unicode is not likely to be smaller).

 I'd like to see some discussion on this. Read this page to inform
yourself:
 http://oss.software.ibm.com/icu/userguide/index.html


 Finally, back to strings. Ben was right. The ICU says:

 "In order to take advantage of Unicode with its large character repertoire
and
 its well-defined properties, there must be types with consistent
definitions and
 semantics. The Unicode standard defines a default encoding based on 16-bit
code
 units. This is supported in ICU by the definition of the UChar to be an
unsigned
 16-bit integer type. This is the base type for character arrays for
strings in
 ICU."

 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
special
 string class - a string is just an array of wchars. So obviously, if we go
the
 ICU route, I withdraw my suggestion to ditch the wchar.

 What I now recommend is:

 (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
 around ICU). Eventually I hope for this to change into "std.icu" (as I
 originally hoped that "etc.unicode" would turn into "std.unicode").

 (2) Ditch the char. 8-bits is really too small for a character these days,
 honestly, and all previous arguments still apply. The existence of char
only
 encourages ASCII and discourages Unicode anyway.

 (3) Native D strings shall be arrays of wchars. This means that
 Object.toString() must return a wchar[], and string literals in D source
must
 compile to wchar[]s. UCI's type UChar would map directly to wchar. To
reinforce
 this, there should probably be an



 in object.d.

 (4) We retain dchar (so that we can get character properties), but all
string
 code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map
directly
 to dchar.

 (5) Transcoding/streams/etc. go ahead as planned, but based around
wchar[]s
 instead of dchar[]s.

 (6) That's pretty much it, although once "char" is gone, we could rename
"wchar"
 as "char" (a la Java).


 Discussion please? And I really do want this talked through because it
affects D
 work I'm currently involved in.

 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].

 Okay, let's chew this one over.

 Jill
Aug 23 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
special
 string class - a string is just an array of wchars. So obviously, if we go
the
 ICU route, I withdraw my suggestion to ditch the wchar.
Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].
 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].
My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.
Aug 23 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Mon, 23 Aug 2004 14:06:57 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgcv4n$2fsf$1 digitaldaemon.com...
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
special
 string class - a string is just an array of wchars. So obviously, if we 
 go
the
 ICU route, I withdraw my suggestion to ditch the wchar.
Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].
 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].
My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.
YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see my post here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/9494 for my arguments. I would add one additional argument. I can't imagine the suggested change breaking existing code.. more likely it will fix existing bugs. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 23 2004
parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].
Excuse me if I'm saying something stupid but byte[] would not do the job of interfacing with C char[]?
Aug 23 2004
next sibling parent Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 01:18:09 +0200, Juanjo Álvarez 
<juanjuxNO SPAMyahoo.es> wrote:
 Regan Heath wrote:
I didn't write this! :)
 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].
Excuse me if I'm saying something stupid but byte[] would not do the job of interfacing with C char[]?
I almost made that same comment in reply to 'Walter' (who made the comment above) I think you're right, you could use byte[], in fact it'd be more correct to use byte[] as the C 'char' type is a byte with no specified encoding (whereas D's char[] is utf-8 encoded). If we had no char[] you'd have to transcode the byte[] to dchar[], this is why I disagree with removing char[], I think char[] has a place in D, I just want to see implicit transcoding of char[] to wchar[] to dchar[]. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 23 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message
news:cgdu4b$2ed$1 digitaldaemon.com...
 Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].
Excuse me if I'm saying something stupid but byte[] would not do the job
of
 interfacing with C char[]?
That could work, but it just wouldn't look right.
Aug 23 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Mon, 23 Aug 2004 16:41:10 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message
 news:cgdu4b$2ed$1 digitaldaemon.com...
 Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].
Excuse me if I'm saying something stupid but byte[] would not do the job
of
 interfacing with C char[]?
That could work, but it just wouldn't look right.
http://www.digitalmars.com/d/htomodule.html Specifically states that C's 'char' should be represented by a 'byte' in D. So when building an interface to the C lib that uses char[] you'd use byte[]. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 23 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte' in
D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].
I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
Aug 23 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte' in
D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].
I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
However, an old C lib might return latin-1 (or any other encoding) encoded data, in which case you also have to use ubyte then transcode to utf-8 and store in char[] (if that is the desired result). Right? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9o90n65a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com>
 wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte'
in
 D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].
I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
However, an old C lib might return latin-1 (or any other encoding) encoded data, in which case you also have to use ubyte then transcode to utf-8 and store in char[] (if that is the desired result). Right?
Yup. You'll have to understand what the C code is using the char type for, in order to select the best equivalent D type.
Aug 25 2004
prev sibling next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best. It'll
also become a nuisance interfacing with C. I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.
Implicit conversions are absolutely fine by me. In fact, that was the first suggestion (then the discussion wandered, as things do, along the lines of "if they're interchangable, why not have just the one type". But sure, I'd be more than happy with implicit conversions between the three UTF types. Since such conversions lose no information in any direction, they are always guaranteed to be harmless. Jill
Aug 24 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best.
It's usually regarded as the best by most other sources, however. Converting between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(), because UTF-16 encoding is very, very simple. UTF-16 is as efficient for the codepoint range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to U+007F. Outside of these ranges, UTF-16 is still very fast, since all remaining characters consist of /precisely/ two wchars, wheras UTF-8 conversion outside of the ASCII range is always going to be inefficient, what with it's variable number of required bytes, variable width bitmasks, and the additional requirements of validation and rejection of non-shortest sequences. If you're arguing on the basis performance, UTF-8 loses hands down.
It'll
also become a nuisance interfacing with C.
Actually, it's D's char (which C doesn't have) which is a nuisance interfacing with C. As others have pointed out, C has no type which enforces UTF-8 encoding, and in fact on the Windows PC on which I am typing right now, every C char is going to be storing characters from Windows code page 1252 unless I take special action to do otherwise. That is /not/ interchangable with D's chars. Beyond U+007F, one C char corresponds to two (or more) D chars. You don't regard that as a nuisance? In fact, I believe that comment of yours which I just quoted above, actually adds further weight to my argument. I argue that the existence of the char type *causes confusion*. People /think/ (erroneously) that it does the same job as C's char, and is interchangable therewith. Now if you, the architect of D, can fall prey to that confusion, I would take that as clear evidence that such confusion exists.
I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.
Yes. If all three D string types were implicitly convertable, then there would be nothing for me to complain about. (The char confusion would still exist, but that's just education) Arcane Jill
Aug 24 2004
next sibling parent reply =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:
Arcane Jill wrote:
 (...)
 adds further weight to my argument. I argue that the existence of the char type
 *causes confusion*. People /think/ (erroneously) that it does the same job as
 C's char, and is interchangable therewith. Now if you, the architect of D, can
 fall prey to that confusion, I would take that as clear evidence that such
 confusion exists.
 (...)
I'm sorry to interrupt. From your reply I would say that you are arguing more about the name of the type been used as the *representation of a character* (which is not). Maybe we should ask Walter to change the names of the types to utf8, utf16 and utf32. I read somewhere that those where the original names in earlier DMD implementations. If that's what you are asking, you have my vote. -- Julio César Carrascal Urquijo
Aug 24 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgfmdm$v3t$1 digitaldaemon.com>, 

Maybe we should ask Walter to change the names of the types to utf8, 
utf16 and utf32. I read somewhere that those where the original names in 
earlier DMD implementations.

If that's what you are asking, you have my vote.
No, I wasn't asking that, and I don't really care what things are called. I do /try/ to stay on the topic of the thread title, and in this thread the discussion is about how/whether to make use of ICU for our internationalization and Unicode needs. I've been looking more closely at ICU, and I keep being (pleasantly) surprised to discover that it already has zillions of other goodies not previously mentioned in this thread, but which we've been talking about in this forum in the past. For example - Locales, ResourceBundles, everything you need for text internationalization/localization. It's relevant to D's character types because ICU has only two character types - a type equivalent to wchar that is used to make UTF-16 strings, and a type equivalent to dchar that is used to access character properties. The important detail here is that ICU strings are wchar[]s, but D's basic "string" concept is char[]. So, calling lots of ICU routines would result in lots of explicit toUTF8() and toUTF16() calls all over your code, /unless/ either: (1) D adopted wchar[] as the basic string type, or (2) D implicitly auto-converted between its various string types as required I'm trying to suggest that (1) is the best option. That's all. Walter prefers (2), but that's acceptable too. As a corrollary to (1), if we start using wchar[]s as the default native string used by Phobos and the compiler, it would then follow that the char type would be superfluous and could be dropped. Or at least, it seems that way to me. Opinions differ. However, the "should we ditch the char or not?" discussion is over on another thread. Renaming the character types is kind of irrelevant to this, although it is pertainant to Walter's reply. No - I'm not asking that they be renamed (except insofar as, if "char" is ditched, then "wchar" could be renamed "char", but that again is for the other thread). Arcane Jill
Aug 24 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgf8d1$os0$1 digitaldaemon.com...
 In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best.
It's usually regarded as the best by most other sources, however.
Converting
 between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(),
because
 UTF-16 encoding is very, very simple. UTF-16 is as efficient for the
codepoint
 range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to
U+007F.
 Outside of these ranges, UTF-16 is still very fast, since all remaining
 characters consist of /precisely/ two wchars, wheras UTF-8 conversion
outside of
 the ASCII range is always going to be inefficient, what with it's variable
 number of required bytes, variable width bitmasks, and the additional
 requirements of validation and rejection of non-shortest sequences. If
you're
 arguing on the basis performance, UTF-8 loses hands down.
Converting would be faster, sure, but if the bulk of your app is char[], there is little conversion happening.
It'll
also become a nuisance interfacing with C.
Actually, it's D's char (which C doesn't have) which is a nuisance
interfacing
 with C. As others have pointed out, C has no type which enforces UTF-8
encoding,
 and in fact on the Windows PC on which I am typing right now, every C char
is
 going to be storing characters from Windows code page 1252 unless I take
special
 action to do otherwise. That is /not/ interchangable with D's chars.
Beyond
 U+007F, one C char corresponds to two (or more) D chars. You don't regard
that
 as a nuisance?
I've been dealing with multibyte charsets in C for decades - it's not just UTF-8 that's multibyte, there are also the Shift-JIS, Korean, and Taiwan code pages. You can also set up your windows machine so UTF-8 *is* the charset used by the "A" APIs. I've written UTF-8 apps in C, and D would map onto them directly. There is no way to avoid, when interfacing with C, dealing with whatever charset it might be in. It can't happen automatically. And that is the source of the nuisance.
 In fact, I believe that comment of yours which I just quoted above,
actually
 adds further weight to my argument. I argue that the existence of the char
type
 *causes confusion*. People /think/ (erroneously) that it does the same job
as
 C's char, and is interchangable therewith. Now if you, the architect of D,
can
 fall prey to that confusion, I would take that as clear evidence that such
 confusion exists.

I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.
Yes. If all three D string types were implicitly convertable, then there
would
 be nothing for me to complain about. (The char confusion would still
exist, but
 that's just education)
Aug 24 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Following Ben's mention of ICU, I've been checking out what it does and
doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not
just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's
also
 free, open source, and with a license which basically says there's no
problem
 with our using it.

 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.

 It's not completely straightforward. ICU is written in C++ (not C), and so
we
 can't link against it directly. It uses classes, not raw functions. So,
I'd have
 to write a C wrapper around ICU which gave me a C interface, and /then/
I'd have
 to write a D wrapper to call the C wrapper - at which point we could get
the
 classes back again (and our own choice of architecture, so plugging into
std or
 mango streams won't suffer).

 But the outermost (D) wrapper can, at least, be composed of D classes.

 If we want D to be the language of choice for Unicode, we would need all
this
 functionality. So, if we went the ICU route, we'd need to bundle ICU
(currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up
with.
 (etc.unicode is not likely to be smaller).

 I'd like to see some discussion on this. Read this page to inform
yourself:
 http://oss.software.ibm.com/icu/userguide/index.html
It sounds pretty cool. It being a very large library, is only what you use linked in? Or does using anything tend to pull in the whole shebang? I hope it's the former!
Aug 25 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cghofr$202k$3 digitaldaemon.com>, Walter says...

It sounds pretty cool. It being a very large library, is only what you use
linked in? Or does using anything tend to pull in the whole shebang? I hope
it's the former!
No idea (yet). I worried about that myself, but it's a library not an indivisibile object file, so it must have /some/ granularity. But the functionality of ICU goes way beyond what I was even planning to achieve - there's even stuff for font rendering in there for feck's sake! (Unicode allows you to put any accent over any glyph, ligate any two glyphs into one, etc. Font rendering engines have a fair bit of work to do). So ICU is hard to beat. Therefore, I'm inclining to the view that we'd be better off trying to plumb ICU into D than to reject it because it fails to meet any one single D requirement. So, if it won't split into sufficiently small chunks on linking, I'd say that was as good an argument as any for improving D's DLL support. ICU would clearly be an excellent candidate for a DLL (assuming we can have classes in DLLs). Jill
Aug 25 2004
parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cghrn3$21dg$1 digitaldaemon.com...
 In article <cghofr$202k$3 digitaldaemon.com>, Walter says...

It sounds pretty cool. It being a very large library, is only what you
use
linked in? Or does using anything tend to pull in the whole shebang? I
hope
it's the former!
No idea (yet). I worried about that myself, but it's a library not an indivisibile object file, so it must have /some/ granularity. But the functionality of ICU goes way beyond what I was even planning to
achieve
 - there's even stuff for font rendering in there for feck's sake! (Unicode
 allows you to put any accent over any glyph, ligate any two glyphs into
one,
 etc. Font rendering engines have a fair bit of work to do). So ICU is hard
to
 beat. Therefore, I'm inclining to the view that we'd be better off trying
to
 plumb ICU into D than to reject it because it fails to meet any one single
D
 requirement.

 So, if it won't split into sufficiently small chunks on linking, I'd say
that
 was as good an argument as any for improving D's DLL support. ICU would
clearly
 be an excellent candidate for a DLL (assuming we can have classes in
DLLs). The ICU has a C API, you want to port the C++ API to D, why not compile a DLL with the C API, and a lib with the ported D API? This would also be good for minimum disturbance between the ICU source and the D API. Roald
Aug 25 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgi7f6$25qf$1 digitaldaemon.com>, Roald Ribe says...

The ICU has a C API, you want to port the C++ API to D, why not
compile a DLL with the C API, and a lib with the ported D API?
This would also be good for minimum disturbance between the ICU source
and the D API.
This is possibly the best idea yet, but I'm going to have to study the ICU source before I know how feasible that is. Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources. D programmers are accustomed to letting the garbage collector do the releasing. (2) ICU's C++ API requires classes to have a copy constructor and an assignment operator. This isn't possible with D. The above two points mean that the D API is probably going to look more like the Java API than the C++ API. Exactly how much of it can be done via a wrapper around C I'm not sure (yet). But yes - if the C-based core can be put into a DLL, and a D-API built which calls it, I guess that would work a treat. Arcane Jill
Aug 25 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.
What are those two operators used for? Could the functionality be worked around somehow? Sean
Aug 25 2004
next sibling parent pragma <pragma_member pathlink.com> writes:
In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.
What are those two operators used for? Could the functionality be worked around somehow?
I'm sure it can be worked around, as there is a Java interface to ICU already available. ( http://oss.software.ibm.com/icu4j/download/ ) ;) This table ( http://oss.software.ibm.com/icu4j/comparison/index.html ) is about as close an indicator as we're going to get for what D would need to do to compete with what's already out there. -Pragma [[ EricAnderton at (code it, and they will come) yahoo.com ]]
Aug 25 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.
What are those two operators used for?
I don't know. I'm still looking into it all. Fortunately, I don't think it matters.
Could the functionality be worked around somehow?
Obviously yes, since there is a Java API.
Aug 25 2004
parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgihb5$2aub$1 digitaldaemon.com...
 In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources.
D
programmers are accustomed to letting the garbage collector do the
releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an
assignment
operator. This isn't possible with D.
What are those two operators used for?
I don't know. I'm still looking into it all. Fortunately, I don't think it matters.
Could the functionality be worked around somehow?
Obviously yes, since there is a Java API.
The Java API would likely be the best starting point. IBM is no slouch in its support of Java, and the Java interface will have solved the issues with interfacing to a GC language.
Aug 25 2004