digitalmars.D - ICU (International Components for Unicode)

Arcane Jill (57/57) Aug 23 2004 Following Ben's mention of ICU, I've been checking out what it does and ...

Arcane Jill (15/26) Aug 23 2004 Or I could port it to D!
=?ISO-8859-1?Q?J=F6rg_R=FCppel?= (6/16) Aug 23 2004 According to the API docs at

Arcane Jill (9/12) Aug 23 2004 I just didn't see it, that's all.

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (2/126) Aug 23 2004
antiAlias (50/107) Aug 23 2004 This would be a great thing for D to adopt. Just a few things to note:
Walter (9/14) Aug 23 2004 special

Regan Heath (11/29) Aug 23 2004 YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see ...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (3/6) Aug 23 2004 Excuse me if I'm saying something stupid but byte[] would not do the job...

Regan Heath (14/21) Aug 23 2004 I didn't write this! :)
Walter (4/10) Aug 23 2004 of

Regan Heath (9/21) Aug 23 2004 http://www.digitalmars.com/d/htomodule.html

Walter (6/10) Aug 23 2004 D.

Regan Heath (9/20) Aug 24 2004 However, an old C lib might return latin-1 (or any other encoding) encod...

Walter (5/23) Aug 25 2004 in

Arcane Jill (8/11) Aug 24 2004 Implicit conversions are absolutely fine by me. In fact, that was the fi...
Arcane Jill (28/33) Aug 24 2004 It's usually regarded as the best by most other sources, however. Conver...

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (10/17) Aug 24 2004 I'm sorry to interrupt. From your reply I would say that you are arguing...

Arcane Jill (29/33) Aug 24 2004 No, I wasn't asking that, and I don't really care what things are called...

Walter (30/62) Aug 24 2004 Converting

Walter (18/39) Aug 25 2004 doesn't

Arcane Jill (14/17) Aug 25 2004 No idea (yet). I worried about that myself, but it's a library not an

Roald Ribe (17/33) Aug 25 2004 use

Arcane Jill (14/18) Aug 25 2004 This is possibly the best idea yet, but I'm going to have to study the I...

Sean Kelly (4/9) Aug 25 2004 What are those two operators used for? Could the functionality be worke...

pragma (8/17) Aug 25 2004 I'm sure it can be worked around, as there is a Java interface to ICU al...
Arcane Jill (4/13) Aug 25 2004 I don't know. I'm still looking into it all. Fortunately, I don't think ...

Walter (8/23) Aug 25 2004 D

Arcane Jill <Arcane_member pathlink.com> writes:

Following Ben's mention of ICU, I've been checking out what it does and doesn't
do. Basically, it does EVERYTHING. There is the work of years there. Not just
Unicode stuff, but a whole swathe of classes for internationalization and
transcoding. It would take me a very long time to duplicate that. It's also
free, open source, and with a license which basically says there's no problem
with our using it.

So I'm thinking seriously about ditching the whole etc.unicode project and
replacing it with a wrapper around ICU.

It's not completely straightforward. ICU is written in C++ (not C), and so we
can't link against it directly. It uses classes, not raw functions. So, I'd have
to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
to write a D wrapper to call the C wrapper - at which point we could get the
classes back again (and our own choice of architecture, so plugging into std or
mango streams won't suffer).

But the outermost (D) wrapper can, at least, be composed of D classes.

If we want D to be the language of choice for Unicode, we would need all this
functionality. So, if we went the ICU route, we'd need to bundle ICU (currently
a ten megabyte zip file) with DMD, along with whatever wrapper I come up with.
(etc.unicode is not likely to be smaller).

I'd like to see some discussion on this. Read this page to inform yourself:
http://oss.software.ibm.com/icu/userguide/index.html


Finally, back to strings. Ben was right. The ICU says:

"In order to take advantage of Unicode with its large character repertoire and
its well-defined properties, there must be types with consistent definitions and
semantics. The Unicode standard defines a default encoding based on 16-bit code
units. This is supported in ICU by the definition of the UChar to be an unsigned
16-bit integer type. This is the base type for character arrays for strings in
ICU."

Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special
string class - a string is just an array of wchars. So obviously, if we go the
ICU route, I withdraw my suggestion to ditch the wchar.

What I now recommend is:

(1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
around ICU). Eventually I hope for this to change into "std.icu" (as I
originally hoped that "etc.unicode" would turn into "std.unicode").

(2) Ditch the char. 8-bits is really too small for a character these days,
honestly, and all previous arguments still apply. The existence of char only
encourages ASCII and discourages Unicode anyway.

(3) Native D strings shall be arrays of wchars. This means that
Object.toString() must return a wchar[], and string literals in D source must
compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce
this, there should probably be an



in object.d.

(4) We retain dchar (so that we can get character properties), but all string
code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly
to dchar.

(5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s
instead of dchar[]s.

(6) That's pretty much it, although once "char" is gone, we could rename "wchar"
as "char" (a la Java).


Discussion please? And I really do want this talked through because it affects D
work I'm currently involved in.

Input is also requested from Walter - in particular the request that
Object.toString() be re-jigged to return wchar[] instead of char[].

Okay, let's chew this one over.

Jill

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgcv4n$2fsf$1 digitaldaemon.com>, Arcane Jill says...

It's not completely straightforward. ICU is written in C++ (not C), and so we
can't link against it directly. It uses classes, not raw functions. So, I'd have
to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
to write a D wrapper to call the C wrapper - at which point we could get the
classes back again (and our own choice of architecture, so plugging into std or
mango streams won't suffer).

Or I could port it to D!

Or...

In article <cgd0h9$2ggf$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement
about ICU is really contagious, only one question, are the C wrappers at
the same level than the C++/Java ones? If so it seems that with a little
easy and boring (compared to writing etc.unicode) wrapping we're going to
have a first-class Unicode lib :) => (i18n version of <g>)

I don't know, as I've only just started looking into it. But either way (port or
wrap C) we move the time spend on development from a year or so down to only a
few months. It really is worth thinking about, but it /does/ mean that D really
should standardize on wchar[] strings, and this has consequences for (a) parsing
string literals, and (b) Object.toString() - and probably a few other things
too, not to mention all the code it would break, and the future (or not) of the
char type. It's all this that I'd be concerned about, and should really be
discussed by all of us in the D community, and Walter as its architect - not
just those of us interested in Unicode.

Arcane Jill

Aug 23 2004

=?ISO-8859-1?Q?J=F6rg_R=FCppel?= <joerg sharky-x.de> writes:

Arcane Jill wrote:
 
 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.
 
 It's not completely straightforward. ICU is written in C++ (not C), and so we
 can't link against it directly. It uses classes, not raw functions. So, I'd
have
 to write a C wrapper around ICU which gave me a C interface, and /then/ I'd
have
 to write a D wrapper to call the C wrapper - at which point we could get the
 classes back again (and our own choice of architecture, so plugging into std or
 mango streams won't suffer).

According to the API docs at 
http://oss.software.ibm.com/icu/apiref/index.html there is a C API. 
Didn't you see that or is there a reason why that can't be used?

Regards,
J�rg

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgd5go$2j1k$1 digitaldaemon.com>, =?ISO-8859-1?Q?J=F6rg_R=FCppel?=
says...

According to the API docs at 
http://oss.software.ibm.com/icu/apiref/index.html there is a C API. 
Didn't you see that or is there a reason why that can't be used?

I just didn't see it, that's all.

I think now that the best approach would be something which is part-port and
part-wrapper around the C API. I would want the D interface to maintain the
classes and so forth which are present in the C++ and Java APIs, so the D API's
C wrappers would have to be part-port anyway, even if only to recreate the class
heirarchy and put things back into member functions.

Jill

Aug 23 2004

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:

Most of ICU it's written in C with C++ wrappers.



Arcane Jill wrote:
 Following Ben's mention of ICU, I've been checking out what it does and doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's also
 free, open source, and with a license which basically says there's no problem
 with our using it.
 
 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.
 
 It's not completely straightforward. ICU is written in C++ (not C), and so we
 can't link against it directly. It uses classes, not raw functions. So, I'd
have
 to write a C wrapper around ICU which gave me a C interface, and /then/ I'd
have
 to write a D wrapper to call the C wrapper - at which point we could get the
 classes back again (and our own choice of architecture, so plugging into std or
 mango streams won't suffer).
 
 But the outermost (D) wrapper can, at least, be composed of D classes.
 
 If we want D to be the language of choice for Unicode, we would need all this
 functionality. So, if we went the ICU route, we'd need to bundle ICU (currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up with.
 (etc.unicode is not likely to be smaller).
 
 I'd like to see some discussion on this. Read this page to inform yourself:
 http://oss.software.ibm.com/icu/userguide/index.html
 
 
 Finally, back to strings. Ben was right. The ICU says:
 
 "In order to take advantage of Unicode with its large character repertoire and
 its well-defined properties, there must be types with consistent definitions
and
 semantics. The Unicode standard defines a default encoding based on 16-bit code
 units. This is supported in ICU by the definition of the UChar to be an
unsigned
 16-bit integer type. This is the base type for character arrays for strings in
 ICU."
 
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special
 string class - a string is just an array of wchars. So obviously, if we go the
 ICU route, I withdraw my suggestion to ditch the wchar.
 
 What I now recommend is:
 
 (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
 Path: digitalmars.com!drn
 From: Arcane Jill <Arcane_member pathlink.com>
 Newsgroups: digitalmars.D
 Subject: Syntax for pinning
 Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
 Organization: [http://www.pathlink.com]
 Lines: 39
 Sender: usenet www.digitalmars.com
 Message-ID: <cg9ocj$9ce$1 digitaldaemon.com>
 X-Trace: digitaldaemon.com 1093166291 9614 63.105.9.61 (22 Aug 2004 09:18:11
GMT)
 X-Complaints-To: usenet digitalmars.com
 NNTP-Posting-Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
 X-Newsreader: Direct Read News v3.11a
 Xref: digitalmars.com digitalmars.D:9385
 
 In article <cfit27$snc$1 digitaldaemon.com>, Walter says...
 
 
I know how to build a gc that will compact
memory to deal with fragmentation when it does arise, so although this is
not in the current D it is not a technical problem to do it.

That's probably good news, but:

Will it be possible to mark specific arrays on the heap as "immovable"?

(Please

say yes. I'm sure it is not a technical problem).

Yes. It's called "pinning". And I can guess why you need this feature <g>.

 
 
 
 Walter, hi.
 
 Listen - while the prospect of a compacting GC may be a long way away, I'd like
 to be able to mark specific arrays on the heap as immovable right now, for
 forward compatibility, so that when the new GC comes along everything will
still
 behave as intended.
 
 Could you perhaps reserve a keyword and/or some syntax for pinning? It only has
 around ICU). Eventually I hope for this to change into "std.icu" (as I
 originally hoped that "etc.unicode" would turn into "std.unicode").
 
 (2) Ditch the char. 8-bits is really too small for a character these days,
 honestly, and all previous arguments still apply. The existence of char only
 encourages ASCII and discourages Unicode anyway.
 
 (3) Native D strings shall be arrays of wchars. This means that
 Object.toString() must return a wchar[], and string literals in D source must
 compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce
 this, there should probably be an
 

 
 in object.d.
 
 (4) We retain dchar (so that we can get character properties), but all string
 code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly
 to dchar.
 
 (5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s
 instead of dchar[]s.
 
 (6) That's pretty much it, although once "char" is gone, we could rename
"wchar"
 as "char" (a la Java).
 
 
 Discussion please? And I really do want this talked through because it affects
D
 work I'm currently involved in.
 
 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].
 
 Okay, let's chew this one over.
 
 Jill

Aug 23 2004

"antiAlias" <fu bar.com> writes:

This would be a great thing for D to adopt. Just a few things to note:

1) This would be best served as a DLL (given it's size). In fact, the team
apparently like to compile the string-resource files into DLLs (which makes
a lot of sense IMO). If D treated DLLs as first-class citizens, this would
be a no-brainer. Right now, that's not the case.

2) There's a rather nice String class (C++). That's a perfect candidate for
porting directly to D.

3) From what I've seen, the lib is mostly C. Even better, it eschews the
traditional morass of header files. Building a ICU.d import will be much
easier because of this. The project on dsource.org might handle that part
without issue? Wrapping those library functions with D shells would be nice,
if only to take advantage of D arrays.

4) the transcoders deal with arrays of buffered data, so they're efficient.
ICU has transcoders and code-page tables up-the-wazoo.

For those who haven't looked through the lib, it's far more than just
Unicode transcoders (as Jill notes): you get sophisticated and flexible date
& number parsing/formatting; I18N message ordering; BiDi support;
collation/sorting support; text-break algorithms; a text layout engine;
unicode regex; and much more.

It's a first-class suite of libraries, and an awesome resource to leverage.



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Following Ben's mention of ICU, I've been checking out what it does and

doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not

just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's

also
 free, open source, and with a license which basically says there's no

problem
 with our using it.

 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.

 It's not completely straightforward. ICU is written in C++ (not C), and so

we
 can't link against it directly. It uses classes, not raw functions. So,

I'd have
 to write a C wrapper around ICU which gave me a C interface, and /then/

I'd have
 to write a D wrapper to call the C wrapper - at which point we could get

the
 classes back again (and our own choice of architecture, so plugging into

std or
 mango streams won't suffer).

 But the outermost (D) wrapper can, at least, be composed of D classes.

 If we want D to be the language of choice for Unicode, we would need all

this
 functionality. So, if we went the ICU route, we'd need to bundle ICU

(currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up

with.
 (etc.unicode is not likely to be smaller).

 I'd like to see some discussion on this. Read this page to inform

yourself:
 http://oss.software.ibm.com/icu/userguide/index.html


 Finally, back to strings. Ben was right. The ICU says:

 "In order to take advantage of Unicode with its large character repertoire

and
 its well-defined properties, there must be types with consistent

definitions and
 semantics. The Unicode standard defines a default encoding based on 16-bit

code
 units. This is supported in ICU by the definition of the UChar to be an

unsigned
 16-bit integer type. This is the base type for character arrays for

strings in
 ICU."

 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no

special
 string class - a string is just an array of wchars. So obviously, if we go

the
 ICU route, I withdraw my suggestion to ditch the wchar.

 What I now recommend is:

 (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
 around ICU). Eventually I hope for this to change into "std.icu" (as I
 originally hoped that "etc.unicode" would turn into "std.unicode").

 (2) Ditch the char. 8-bits is really too small for a character these days,
 honestly, and all previous arguments still apply. The existence of char

only
 encourages ASCII and discourages Unicode anyway.

 (3) Native D strings shall be arrays of wchars. This means that
 Object.toString() must return a wchar[], and string literals in D source

must
 compile to wchar[]s. UCI's type UChar would map directly to wchar. To

reinforce
 this, there should probably be an



 in object.d.

 (4) We retain dchar (so that we can get character properties), but all

string
 code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map

directly
 to dchar.

 (5) Transcoding/streams/etc. go ahead as planned, but based around

wchar[]s
 instead of dchar[]s.

 (6) That's pretty much it, although once "char" is gone, we could rename

"wchar"
 as "char" (a la Java).


 Discussion please? And I really do want this talked through because it

affects D
 work I'm currently involved in.

 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].

 Okay, let's chew this one over.

 Jill

Aug 23 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no

special
 string class - a string is just an array of wchars. So obviously, if we go

the
 ICU route, I withdraw my suggestion to ditch the wchar.

Ok, but suppose we ditch char[]. Then, we find some great library we want to
bring into D, or build a D interface too, that is in char[].

 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].

My experience with all-wchar is that its performance is not the best. It'll
also become a nuisance interfacing with C. I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Mon, 23 Aug 2004 14:06:57 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgcv4n$2fsf$1 digitaldaemon.com...
 Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no

 special
 string class - a string is just an array of wchars. So obviously, if we 
 go

 the
 ICU route, I withdraw my suggestion to ditch the wchar.

 Ok, but suppose we ditch char[]. Then, we find some great library we 
 want to
 bring into D, or build a D interface too, that is in char[].

 Input is also requested from Walter - in particular the request that
 Object.toString() be re-jigged to return wchar[] instead of char[].

 My experience with all-wchar is that its performance is not the best. 
 It'll
 also become a nuisance interfacing with C. I'd rather explore perhaps 
 making
 implict conversions between the 3 utf types more seamless.

YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see my 
post here:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/9494

for my arguments.

I would add one additional argument. I can't imagine the suggested change 
breaking existing code.. more likely it will fix existing bugs.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 23 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].


Excuse me if I'm saying something stupid but byte[] would not do the job of
interfacing with C char[]?

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 01:18:09 +0200, Juanjo �lvarez 
<juanjuxNO SPAMyahoo.es> wrote:
 Regan Heath wrote:

I didn't write this! :)

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].


 Excuse me if I'm saying something stupid but byte[] would not do the job 
 of
 interfacing with C char[]?

I almost made that same comment in reply to 'Walter' (who made the comment 
above)

I think you're right, you could use byte[], in fact it'd be more correct 
to use byte[] as the C 'char' type is a byte with no specified encoding 
(whereas D's char[] is utf-8 encoded).

If we had no char[] you'd have to transcode the byte[] to dchar[], this is 
why I disagree with removing char[], I think char[] has a place in D, I 
just want to see implicit transcoding of char[] to wchar[] to dchar[].

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 23 2004

"Walter" <newshound digitalmars.com> writes:

"Juanjo �lvarez" <juanjuxNO SPAMyahoo.es> wrote in message
news:cgdu4b$2ed$1 digitaldaemon.com...
 Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].


 Excuse me if I'm saying something stupid but byte[] would not do the job

of
 interfacing with C char[]?

That could work, but it just wouldn't look right.

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Mon, 23 Aug 2004 16:41:10 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Juanjo �lvarez" <juanjuxNO SPAMyahoo.es> wrote in message
 news:cgdu4b$2ed$1 digitaldaemon.com...
 Regan Heath wrote:

 Ok, but suppose we ditch char[]. Then, we find some great library we
 want to
 bring into D, or build a D interface too, that is in char[].


 Excuse me if I'm saying something stupid but byte[] would not do the job

 of
 interfacing with C char[]?

 That could work, but it just wouldn't look right.

http://www.digitalmars.com/d/htomodule.html

Specifically states that C's 'char' should be represented by a 'byte' in D.
So when building an interface to the C lib that uses char[] you'd use 
byte[].

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 23 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte' in

D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].

I'm sorry that wasn't clear, but I meant that when 'unsigned char' and
'signed char' in C are used not as text, but as very small integers, the
corresponding D types should be ubyte and byte.

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte' in

 D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].

 I'm sorry that wasn't clear, but I meant that when 'unsigned char' and
 'signed char' in C are used not as text, but as very small integers, the
 corresponding D types should be ubyte and byte.

However, an old C lib might return latin-1 (or any other encoding) encoded 
data, in which case you also have to use ubyte then transcode to utf-8 and 
store in char[] (if that is the desired result).

Right?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9o90n65a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com>
 wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7lqdnp5a2sq9 digitalmars.com...
 http://www.digitalmars.com/d/htomodule.html

 Specifically states that C's 'char' should be represented by a 'byte'



in
 D.
 So when building an interface to the C lib that uses char[] you'd use
 byte[].

 I'm sorry that wasn't clear, but I meant that when 'unsigned char' and
 'signed char' in C are used not as text, but as very small integers, the
 corresponding D types should be ubyte and byte.

 However, an old C lib might return latin-1 (or any other encoding) encoded
 data, in which case you also have to use ubyte then transcode to utf-8 and
 store in char[] (if that is the desired result).

 Right?

Yup. You'll have to understand what the C code is using the char type for,
in order to select the best equivalent D type.

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best. It'll
also become a nuisance interfacing with C. I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.

Implicit conversions are absolutely fine by me. In fact, that was the first
suggestion (then the discussion wandered, as things do, along the lines of "if
they're interchangable, why not have just the one type".

But sure, I'd be more than happy with implicit conversions between the three UTF
types. Since such conversions lose no information in any direction, they are
always guaranteed to be harmless.

Jill

Aug 24 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best.

It's usually regarded as the best by most other sources, however. Converting
between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(), because
UTF-16 encoding is very, very simple. UTF-16 is as efficient for the codepoint
range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to U+007F.
Outside of these ranges, UTF-16 is still very fast, since all remaining
characters consist of /precisely/ two wchars, wheras UTF-8 conversion outside of
the ASCII range is always going to be inefficient, what with it's variable
number of required bytes, variable width bitmasks, and the additional
requirements of validation and rejection of non-shortest sequences. If you're
arguing on the basis performance, UTF-8 loses hands down.


It'll
also become a nuisance interfacing with C.

Actually, it's D's char (which C doesn't have) which is a nuisance interfacing
with C. As others have pointed out, C has no type which enforces UTF-8 encoding,
and in fact on the Windows PC on which I am typing right now, every C char is
going to be storing characters from Windows code page 1252 unless I take special
action to do otherwise. That is /not/ interchangable with D's chars. Beyond
U+007F, one C char corresponds to two (or more) D chars. You don't regard that
as a nuisance?

In fact, I believe that comment of yours which I just quoted above, actually
adds further weight to my argument. I argue that the existence of the char type
*causes confusion*. People /think/ (erroneously) that it does the same job as
C's char, and is interchangable therewith. Now if you, the architect of D, can
fall prey to that confusion, I would take that as clear evidence that such
confusion exists.




I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.

Yes. If all three D string types were implicitly convertable, then there would
be nothing for me to complain about. (The char confusion would still exist, but
that's just education)

Arcane Jill

Aug 24 2004

=?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= writes:

Arcane Jill wrote:
 (...)
 adds further weight to my argument. I argue that the existence of the char type
 *causes confusion*. People /think/ (erroneously) that it does the same job as
 C's char, and is interchangable therewith. Now if you, the architect of D, can
 fall prey to that confusion, I would take that as clear evidence that such
 confusion exists.
 (...)

I'm sorry to interrupt. From your reply I would say that you are arguing 
more about the name of the type been used as the *representation of a 
character* (which is not).

Maybe we should ask Walter to change the names of the types to utf8, 
utf16 and utf32. I read somewhere that those where the original names in 
earlier DMD implementations.

If that's what you are asking, you have my vote.

--
Julio C�sar Carrascal Urquijo

Aug 24 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgfmdm$v3t$1 digitaldaemon.com>, 

Maybe we should ask Walter to change the names of the types to utf8, 
utf16 and utf32. I read somewhere that those where the original names in 
earlier DMD implementations.

If that's what you are asking, you have my vote.

No, I wasn't asking that, and I don't really care what things are called. I do
/try/ to stay on the topic of the thread title, and in this thread the
discussion is about how/whether to make use of ICU for our internationalization
and Unicode needs. I've been looking more closely at ICU, and I keep being
(pleasantly) surprised to discover that it already has zillions of other goodies
not previously mentioned in this thread, but which we've been talking about in
this forum in the past. For example - Locales, ResourceBundles, everything you
need for text internationalization/localization.

It's relevant to D's character types because ICU has only two character types -
a type equivalent to wchar that is used to make UTF-16 strings, and a type
equivalent to dchar that is used to access character properties. The important
detail here is that ICU strings are wchar[]s, but D's basic "string" concept is
char[]. So, calling lots of ICU routines would result in lots of explicit
toUTF8() and toUTF16() calls all over your code, /unless/ either:

(1) D adopted wchar[] as the basic string type, or
(2) D implicitly auto-converted between its various string types as required

I'm trying to suggest that (1) is the best option. That's all. Walter prefers
(2), but that's acceptable too.

As a corrollary to (1), if we start using wchar[]s as the default native string
used by Phobos and the compiler, it would then follow that the char type would
be superfluous and could be dropped. Or at least, it seems that way to me.
Opinions differ. However, the "should we ditch the char or not?" discussion is
over on another thread.

Renaming the character types is kind of irrelevant to this, although it is
pertainant to Walter's reply. No - I'm not asking that they be renamed (except
insofar as, if "char" is ditched, then "wchar" could be renamed "char", but that
again is for the other thread).

Arcane Jill

Aug 24 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgf8d1$os0$1 digitaldaemon.com...
 In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...

My experience with all-wchar is that its performance is not the best.

 It's usually regarded as the best by most other sources, however.

Converting
 between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(),

because
 UTF-16 encoding is very, very simple. UTF-16 is as efficient for the

codepoint
 range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to

U+007F.
 Outside of these ranges, UTF-16 is still very fast, since all remaining
 characters consist of /precisely/ two wchars, wheras UTF-8 conversion

outside of
 the ASCII range is always going to be inefficient, what with it's variable
 number of required bytes, variable width bitmasks, and the additional
 requirements of validation and rejection of non-shortest sequences. If

you're
 arguing on the basis performance, UTF-8 loses hands down.

Converting would be faster, sure, but if the bulk of your app is char[],
there is little conversion happening.

It'll
also become a nuisance interfacing with C.

 Actually, it's D's char (which C doesn't have) which is a nuisance

interfacing
 with C. As others have pointed out, C has no type which enforces UTF-8

encoding,
 and in fact on the Windows PC on which I am typing right now, every C char

is
 going to be storing characters from Windows code page 1252 unless I take

special
 action to do otherwise. That is /not/ interchangable with D's chars.

Beyond
 U+007F, one C char corresponds to two (or more) D chars. You don't regard

that
 as a nuisance?

I've been dealing with multibyte charsets in C for decades - it's not just
UTF-8 that's multibyte, there are also the Shift-JIS, Korean, and Taiwan
code pages. You can also set up your windows machine so UTF-8 *is* the
charset used by the "A" APIs. I've written UTF-8 apps in C, and D would map
onto them directly.

There is no way to avoid, when interfacing with C, dealing with whatever
charset it might be in. It can't happen automatically. And that is the
source of the nuisance.

 In fact, I believe that comment of yours which I just quoted above,

actually
 adds further weight to my argument. I argue that the existence of the char

type
 *causes confusion*. People /think/ (erroneously) that it does the same job

as
 C's char, and is interchangable therewith. Now if you, the architect of D,

can
 fall prey to that confusion, I would take that as clear evidence that such
 confusion exists.

I'd rather explore perhaps making
implict conversions between the 3 utf types more seamless.

 Yes. If all three D string types were implicitly convertable, then there

would
 be nothing for me to complain about. (The char confusion would still

exist, but
 that's just education)

Aug 24 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgcv4n$2fsf$1 digitaldaemon.com...
 Following Ben's mention of ICU, I've been checking out what it does and

doesn't
 do. Basically, it does EVERYTHING. There is the work of years there. Not

just
 Unicode stuff, but a whole swathe of classes for internationalization and
 transcoding. It would take me a very long time to duplicate that. It's

also
 free, open source, and with a license which basically says there's no

problem
 with our using it.

 So I'm thinking seriously about ditching the whole etc.unicode project and
 replacing it with a wrapper around ICU.

 It's not completely straightforward. ICU is written in C++ (not C), and so

we
 can't link against it directly. It uses classes, not raw functions. So,

I'd have
 to write a C wrapper around ICU which gave me a C interface, and /then/

I'd have
 to write a D wrapper to call the C wrapper - at which point we could get

the
 classes back again (and our own choice of architecture, so plugging into

std or
 mango streams won't suffer).

 But the outermost (D) wrapper can, at least, be composed of D classes.

 If we want D to be the language of choice for Unicode, we would need all

this
 functionality. So, if we went the ICU route, we'd need to bundle ICU

(currently
 a ten megabyte zip file) with DMD, along with whatever wrapper I come up

with.
 (etc.unicode is not likely to be smaller).

 I'd like to see some discussion on this. Read this page to inform

yourself:
 http://oss.software.ibm.com/icu/userguide/index.html

It sounds pretty cool. It being a very large library, is only what you use
linked in? Or does using anything tend to pull in the whole shebang? I hope
it's the former!

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cghofr$202k$3 digitaldaemon.com>, Walter says...

It sounds pretty cool. It being a very large library, is only what you use
linked in? Or does using anything tend to pull in the whole shebang? I hope
it's the former!

No idea (yet). I worried about that myself, but it's a library not an
indivisibile object file, so it must have /some/ granularity.

But the functionality of ICU goes way beyond what I was even planning to achieve
- there's even stuff for font rendering in there for feck's sake! (Unicode
allows you to put any accent over any glyph, ligate any two glyphs into one,
etc. Font rendering engines have a fair bit of work to do). So ICU is hard to
beat. Therefore, I'm inclining to the view that we'd be better off trying to
plumb ICU into D than to reject it because it fails to meet any one single D
requirement.

So, if it won't split into sufficiently small chunks on linking, I'd say that
was as good an argument as any for improving D's DLL support. ICU would clearly
be an excellent candidate for a DLL (assuming we can have classes in DLLs).

Jill

Aug 25 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cghrn3$21dg$1 digitaldaemon.com...
 In article <cghofr$202k$3 digitaldaemon.com>, Walter says...

It sounds pretty cool. It being a very large library, is only what you


use
linked in? Or does using anything tend to pull in the whole shebang? I


hope
it's the former!

 No idea (yet). I worried about that myself, but it's a library not an
 indivisibile object file, so it must have /some/ granularity.

 But the functionality of ICU goes way beyond what I was even planning to

achieve
 - there's even stuff for font rendering in there for feck's sake! (Unicode
 allows you to put any accent over any glyph, ligate any two glyphs into

one,
 etc. Font rendering engines have a fair bit of work to do). So ICU is hard

to
 beat. Therefore, I'm inclining to the view that we'd be better off trying

to
 plumb ICU into D than to reject it because it fails to meet any one single

D
 requirement.

 So, if it won't split into sufficiently small chunks on linking, I'd say

that
 was as good an argument as any for improving D's DLL support. ICU would

clearly
 be an excellent candidate for a DLL (assuming we can have classes in

DLLs).

The ICU has a C API, you want to port the C++ API to D, why not
compile a DLL with the C API, and a lib with the ported D API?
This would also be good for minimum disturbance between the ICU source
and the D API.

Roald

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgi7f6$25qf$1 digitaldaemon.com>, Roald Ribe says...

The ICU has a C API, you want to port the C++ API to D, why not
compile a DLL with the C API, and a lib with the ported D API?
This would also be good for minimum disturbance between the ICU source
and the D API.

This is possibly the best idea yet, but I'm going to have to study the ICU
source before I know how feasible that is.

Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.

The above two points mean that the D API is probably going to look more like the
Java API than the C++ API. Exactly how much of it can be done via a wrapper
around C I'm not sure (yet).

But yes - if the C-based core can be put into a DLL, and a D-API built which
calls it, I guess that would work a treat.

Arcane Jill

Aug 25 2004

Sean Kelly <sean f4.ca> writes:

In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.

What are those two operators used for?  Could the functionality be worked around
somehow?


Sean

Aug 25 2004

pragma <pragma_member pathlink.com> writes:

In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.

What are those two operators used for?  Could the functionality be worked around
somehow?

I'm sure it can be worked around, as there is a Java interface to ICU already
available.  ( http://oss.software.ibm.com/icu4j/download/ ) ;)

This table ( http://oss.software.ibm.com/icu4j/comparison/index.html ) is about
as close an indicator as we're going to get for what D would need to do to
compete with what's already out there.

-Pragma
[[ EricAnderton at (code it, and they will come) yahoo.com ]]

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources. D
programmers are accustomed to letting the garbage collector do the releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an assignment
operator. This isn't possible with D.

What are those two operators used for?

I don't know. I'm still looking into it all. Fortunately, I don't think it
matters.


Could the functionality be worked around somehow?

Obviously yes, since there is a Java API.

Aug 25 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgihb5$2aub$1 digitaldaemon.com...
 In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...
Some fundamentals:
(1) ICU's C API requires you to acquire and release (memory) resources.



D
programmers are accustomed to letting the garbage collector do the



releasing.
(2) ICU's C++ API requires classes to have a copy constructor and an



assignment
operator. This isn't possible with D.

What are those two operators used for?

 I don't know. I'm still looking into it all. Fortunately, I don't think it
 matters.


Could the functionality be worked around somehow?

 Obviously yes, since there is a Java API.

The Java API would likely be the best starting point. IBM is no slouch in
its support of Java, and the Java interface will have solved the issues with
interfacing to a GC language.

Aug 25 2004

D Programming

C/C++ Programming

Other

digitalmars.D - ICU (International Components for Unicode)