D - Why is there no String?

Helmut Leitner (25/25) Apr 04 2003 I tried something like:

Luna Kid (9/33) Apr 04 2003 Huhh, now this is going to be a long thread... :)
Jonathan Andrew (18/43) Apr 04 2003 Hello,

Matthew Wilson (17/69) Apr 04 2003 Not long; not painful. Simply agree
Benji Smith (6/12) Apr 04 2003 Would it be appropriate to think of a String class within a template. Yo...

Luna Kid (17/29) Apr 04 2003 Sounds like quite a good idea to me.

Helmut Leitner (32/35) Apr 05 2003 Sorry, I can't agree. In programming we deal with real world

J. Daniel Smith (5/40) Apr 05 2003 For better or worse, UNICODE strings are no long as simple as "16-bit

Helmut Leitner (10/13) Apr 06 2003 I understand that, but the existance of certain encodings - e. g. in a w...

Jonathan Andrew (9/21) Apr 04 2003 Howdy,

Ilya Minkov (39/58) Apr 07 2003 Hello.

Achilleas Margaritis (19/77) Apr 15 2003 be -

Ilya Minkov (12/18) Apr 16 2003 It doesn't yet make life easy. IIRC you only have less than 1/4 of this
J. Daniel Smith (6/9) Apr 16 2003 As UNICODE 3.0 shows, 16 bits is not enough. It's looking like around 2...

Helmut Leitner <leitner hls.via.at> writes:

I tried something like:

   alias char [] string;
  
   int main(string [] args) 
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
I just posted in a separate thread would look more beautiful
written as
    String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
wouldn't it?

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 04 2003

"Luna Kid" <lunakid neuropolis.org> writes:

Huhh, now this is going to be a long thread... :)

I think I completely agree with Helmut, and with Matthew W.,
and Mark E. and probably many others. And I let them talk
-- they can do it better.

So the battle has begun... Good luck, folks! :)

Cheers,
Sab


"Helmut Leitner" <leitner hls.via.at> wrote in message
news:3E8D5571.E6234E26 hls.via.at...
 I tried something like:

    alias char [] string;

    int main(string [] args)
    {
        printf("%s\n",(char *)args[0]);
        return 0;
    }

 hey, and it worked!

 But only until I added

    import string;

    :-(

 Ok, no problem. But the question is: Why do you live without
 getting rid of this "string is an char array" type of thinking.
 Although it is the case, there is no need to reflect this in
 all interfaces and the daily work.

 E. g. the call
     char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
 I just posted in a separate thread would look more beautiful
 written as
     String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
 wouldn't it?

 --
 Helmut Leitner    leitner hls.via.at
 Graz, Austria   www.hls-software.com

Apr 04 2003

Jonathan Andrew <Jonathan_member pathlink.com> writes:

Hello,

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?). Basically, stuff that has already been talked about in
here.
A dedicated string type might resolve some of this ambiguity by providing, for
example, both .length (characters) and .size properties (byte-size). Stuff that
is important for strings, but not really appropriate for other array types. I
don't really care too much either way, and if we are stuck with good old ASCII,
it really doesn't matter either way. But if Unicode is put in, then some
mechanism should be put in place to take care of these issues, whether its a
string type or not. And yes, this is probably going to be the start of a long,
painful thread. =)

-Jon

In article <3E8D5571.E6234E26 hls.via.at>, Helmut Leitner says...
I tried something like:

   alias char [] string;
  
   int main(string [] args) 
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
I just posted in a separate thread would look more beautiful
written as
    String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
wouldn't it?

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 04 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

Not long; not painful. Simply agree



"Jonathan Andrew" <Jonathan_member pathlink.com> wrote in message
news:b6ktj0$2o0u$1 digitaldaemon.com...
 Hello,

 I think that an array of chars was probably appropriate, but now that

Unicode is
 being considered for the language, I think a primitive string type might

be
 necessary, because an array of uneven sized chars would be very awkward

when
 talking about indexing (ie. is mystr[6] talking about the 6th byte, or the

6th
 character?) and declaring new strings ("char [40] mystr;", is this 40

bytes, or
 40 characters long?). Basically, stuff that has already been talked about

in
 here.
 A dedicated string type might resolve some of this ambiguity by providing,

for
 example, both .length (characters) and .size properties (byte-size). Stuff

that
 is important for strings, but not really appropriate for other array

types. I
 don't really care too much either way, and if we are stuck with good old

ASCII,
 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether its

a
 string type or not. And yes, this is probably going to be the start of a

long,
 painful thread. =)

 -Jon

 In article <3E8D5571.E6234E26 hls.via.at>, Helmut Leitner says...
I tried something like:

   alias char [] string;

   int main(string [] args)
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files =


DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
I just posted in a separate thread would look more beautiful
written as
    String [] files =


DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
wouldn't it?

--
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 04 2003

Benji Smith <Benji_member pathlink.com> writes:

Would it be appropriate to think of a String class within a template. You could
declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
instantiating the String template.

I know that this isn't what templates are technically desinged to do, but it
seems like a good idea to me.

--Benji

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?).

Apr 04 2003

"Luna Kid" <lunakid neuropolis.org> writes:

Sounds like quite a good idea to me.

Well, but first, it would be good to have that
generic string "something". ;)

Currently, calling an array of bytes a string
is like calling an Otto engine a car... (It's
not even a class inheritance thing.)

Cheers,
Lunar Sab


"Benji Smith" <Benji_member pathlink.com> wrote in message
news:b6l4vj$2t15$1 digitaldaemon.com...
 Would it be appropriate to think of a String class within a template. You

could
 declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
 instantiating the String template.

 I know that this isn't what templates are technically desinged to do, but

it
 seems like a good idea to me.

 --Benji

I think that an array of chars was probably appropriate, but now that


Unicode is
being considered for the language, I think a primitive string type might


be
necessary, because an array of uneven sized chars would be very awkward


when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or


the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40


bytes, or
40 characters long?).

Apr 04 2003

Helmut Leitner <leitner hls.via.at> writes:

Luna Kid wrote:
 Currently, calling an array of bytes a string
 is like calling an Otto engine a car... (It's
 not even a class inheritance thing.)

Sorry, I can't agree. In programming we deal with real world
objects and give them names (like File, String or System). 
These names may be thought to represent virtual objects, 
that may be handled
   - by handles (like MS windows HWND, file handles)
   - by names (like a customer in a database, file delete)
   - by OO object references
   - implicitly withou handle ( SystemRestart(); )
   - ...

I think that the OO feeling is wrong, that only a class is a good 
abstraction and that anything of importance must become a class.
This feeling developed, because OO is overhyped to a point where
this hype effectively reduces the quality of resulting code.

There can be no better API functions than 
   FileDelete("test.txt");
   String []=UrlGetStringArray("http://walterbright.com");
because they express what the programmer wants to do in a 
single "sentence" *and* do it. Notice that there is not a 
single formal object involved.

So IMHO, whether something is a formal object or not, 
should be considered an implementation detail.

If Unicode strings are a topic (I do know little about them,
I always thought we would just switch to 16-Bit-characters
sometime in the future) then someone could create a class
to wrap its complexities. 

I would vote for "class Struni" and suggest that a team 
forms to prototype it to the point, where Walter can just 
take it and make it part of Phobos.

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 05 2003

"J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:

For better or worse, UNICODE strings are no long as simple as "16-bit
characters".  That's just one example of why this is a rather thorny issue.

   Dan

"Helmut Leitner" <leitner hls.via.at> wrote in message
news:3E8E9AAF.865E0DFF hls.via.at...
 Luna Kid wrote:
 Currently, calling an array of bytes a string
 is like calling an Otto engine a car... (It's
 not even a class inheritance thing.)

 Sorry, I can't agree. In programming we deal with real world
 objects and give them names (like File, String or System).
 These names may be thought to represent virtual objects,
 that may be handled
    - by handles (like MS windows HWND, file handles)
    - by names (like a customer in a database, file delete)
    - by OO object references
    - implicitly withou handle ( SystemRestart(); )
    - ...

 I think that the OO feeling is wrong, that only a class is a good
 abstraction and that anything of importance must become a class.
 This feeling developed, because OO is overhyped to a point where
 this hype effectively reduces the quality of resulting code.

 There can be no better API functions than
    FileDelete("test.txt");
    String []=UrlGetStringArray("http://walterbright.com");
 because they express what the programmer wants to do in a
 single "sentence" *and* do it. Notice that there is not a
 single formal object involved.

 So IMHO, whether something is a formal object or not,
 should be considered an implementation detail.

 If Unicode strings are a topic (I do know little about them,
 I always thought we would just switch to 16-Bit-characters
 sometime in the future) then someone could create a class
 to wrap its complexities.

 I would vote for "class Struni" and suggest that a team
 forms to prototype it to the point, where Walter can just
 take it and make it part of Phobos.

 --
 Helmut Leitner    leitner hls.via.at
 Graz, Austria   www.hls-software.com

Apr 05 2003

Helmut Leitner <helmut.leitner chello.at> writes:

"J. Daniel Smith" wrote:
 
 For better or worse, UNICODE strings are no long as simple as "16-bit
 characters".  That's just one example of why this is a rather thorny issue.

I understand that, but the existance of certain encodings - e. g. in a web page
-
doesn't immediately mean to me that there is the need for a native type or a
class 
handling these encodings (maybe to have transforming functions is enough).

I suppose that this comes from my lack of knowledge in this area. 
I hadn't yet time to read the existing older threads about this topic.
Is there a summary of all the issues somewhere?

--
Helmut Leitner    leitner hls.via.at   
Graz, Austria   www.hls-software.com

Apr 06 2003

Jonathan Andrew <Jonathan_member pathlink.com> writes:

Howdy,

I think that the templated way of doing strings is probably adding more
abstraction and complication to the string concept, which at this point should
be a primitive type in the language, whether it is a completely new type or
still an array of chars with special properties. That's just my opinion though,
and I confess to being afraid of templates, coming mostly from a C/Java
background. 

-Jon

In article <b6l4vj$2t15$1 digitaldaemon.com>, Benji Smith says...
Would it be appropriate to think of a String class within a template. You could
declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
instantiating the String template.

I know that this isn't what templates are technically desinged to do, but it
seems like a good idea to me.

--Benji

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?).

Apr 04 2003

Ilya Minkov <midiclub 8ung.at> writes:

Hello.

While char[] is a good and native thing for working with console, simple 
textfiles, and such, it is not a solution for applications processing 
any data subject to internalisation. And that is almost every piece of 
text currently out there.

However, i keep thinking that one dedicated string class is not at all 
enough. I propose - not one - not two - but *at least* three of them.

First and the basic one -
--- a String type - should be an array of 4-byte characters. It is used 
inside functions for processing strings. With modern processors, 
handling 4-byte values may be cheaper than 2-byte and not much costier 
than of 1-byte. As to space considerations - forget them, this type is 
for local chewing only. If you want to keep this string in memory or 
some database consider the second one -
--- a CompactString type - should consist of 2 arrays, first one for raw 
characters, second one for mode changes. The second one is the key. It 
should store a list of events like "at character x change to codepage y, 
encoding z" "or at character x make an exceptional 4-byte value", which 
could be swizzled into a few bytes each. It should also be quite fast to 
handle, since unlike UTF7/8/16, the raw string need not be scanned to 
determine its length, this can be done by scanning mode changes, which 
has to be an order or two of magnitude shorter. And it can adapt itself 
to whatever takes the least space - 8-bit with explicit codepage for 
e.g. european and russian, 16-bit for japanese kanji and somesuch, or 
even 32-bit in rare case you mix all languages evenly. But this type 
would not be directly standards-complying. There should obviously also be -
--- another type which corresponds to the underlying system's preferred 
encoding.

A set of functions also has to be provided to convert any of these types 
to and from any of the other standard unicode types. As to templates - i 
don't hold much of them for these purposes. There is a limited number of 
types - you don't want to create a string of floats, do you? And 
besides, their handling differs in some ways. But making them into 
classes could give further flexibility, at the price of an (8-byte IIRC) 
space overhead per instance.

-i.

PS. i've been away for a short while... 300 new messages! i wonder how 
Walter manages to read them AND maintain two complex compilers!


Jonathan Andrew wrote:
 Hello,
 
 I think that an array of chars was probably appropriate, but now that Unicode
is
 being considered for the language, I think a primitive string type might be
 necessary, because an array of uneven sized chars would be very awkward when
 talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
 character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
 40 characters long?). Basically, stuff that has already been talked about in
 here.
 A dedicated string type might resolve some of this ambiguity by providing, for
 example, both .length (characters) and .size properties (byte-size). Stuff that
 is important for strings, but not really appropriate for other array types. I
 don't really care too much either way, and if we are stuck with good old ASCII,
 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether its a
 string type or not. And yes, this is probably going to be the start of a long,
 painful thread. =)
 
 -Jon

Apr 07 2003

"Achilleas Margaritis" <axilmar in.gr> writes:

"Ilya Minkov" <midiclub 8ung.at> wrote in message
news:b6t1i8$nt4$1 digitaldaemon.com...
 Hello.

 While char[] is a good and native thing for working with console, simple
 textfiles, and such, it is not a solution for applications processing
 any data subject to internalisation. And that is almost every piece of
 text currently out there.

 However, i keep thinking that one dedicated string class is not at all
 enough. I propose - not one - not two - but *at least* three of them.

 First and the basic one -
 --- a String type - should be an array of 4-byte characters. It is used
 inside functions for processing strings. With modern processors,
 handling 4-byte values may be cheaper than 2-byte and not much costier
 than of 1-byte. As to space considerations - forget them, this type is
 for local chewing only. If you want to keep this string in memory or
 some database consider the second one -
 --- a CompactString type - should consist of 2 arrays, first one for raw
 characters, second one for mode changes. The second one is the key. It
 should store a list of events like "at character x change to codepage y,
 encoding z" "or at character x make an exceptional 4-byte value", which
 could be swizzled into a few bytes each. It should also be quite fast to
 handle, since unlike UTF7/8/16, the raw string need not be scanned to
 determine its length, this can be done by scanning mode changes, which
 has to be an order or two of magnitude shorter. And it can adapt itself
 to whatever takes the least space - 8-bit with explicit codepage for
 e.g. european and russian, 16-bit for japanese kanji and somesuch, or
 even 32-bit in rare case you mix all languages evenly. But this type
 would not be directly standards-complying. There should obviously also

be -
 --- another type which corresponds to the underlying system's preferred
 encoding.

 A set of functions also has to be provided to convert any of these types
 to and from any of the other standard unicode types. As to templates - i
 don't hold much of them for these purposes. There is a limited number of
 types - you don't want to create a string of floats, do you? And
 besides, their handling differs in some ways. But making them into
 classes could give further flexibility, at the price of an (8-byte IIRC)
 space overhead per instance.

 -i.

 PS. i've been away for a short while... 300 new messages! i wonder how
 Walter manages to read them AND maintain two complex compilers!


 Jonathan Andrew wrote:
 Hello,

 I think that an array of chars was probably appropriate, but now that


Unicode is
 being considered for the language, I think a primitive string type might


be
 necessary, because an array of uneven sized chars would be very awkward


when
 talking about indexing (ie. is mystr[6] talking about the 6th byte, or


the 6th
 character?) and declaring new strings ("char [40] mystr;", is this 40


bytes, or
 40 characters long?). Basically, stuff that has already been talked


about in
 here.
 A dedicated string type might resolve some of this ambiguity by


providing, for
 example, both .length (characters) and .size properties (byte-size).


Stuff that
 is important for strings, but not really appropriate for other array


types. I
 don't really care too much either way, and if we are stuck with good old


ASCII,
 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether


its a
 string type or not. And yes, this is probably going to be the start of a


long,
 painful thread. =)

 -Jon


Why complicate our lives ? D should use 16-bit unicode and provide implicit
conversions in any I/O, according to environment. These conversions should
be transparent to the user. 65536 characters are enough to represent most
Earth languages...

Apr 15 2003

Ilya Minkov <midiclub tiscali.de> writes:

Achilleas Margaritis wrote:
 
 Why complicate our lives ? D should use 16-bit unicode and provide implicit
 conversions in any I/O, according to environment. These conversions should
 be transparent to the user. 65536 characters are enough to represent most
 Earth languages...
 

It doesn't yet make life easy. IIRC you only have less than 1/4 of this 
set available. Besides, how do you treat separate accents, if they 
cannot be combined into the letters? This is really rare though.

32-bit is better for speed. And besides, you have already let everyone 
down for years with that endless "we don't care, since *our* language 
fits in less than seven bits. now the rest of teh world may share the 
leftovers, if they like." Now, aren't we letting anyone down with 16-bit?

Besides, the second type i proposed, would usually use 1 byte for 
european, cyrillic, arabic, hebrew, greek and such, unlike UTF-8 and 
UTF-16 which *both* requiere 2, so it's better for storage!

-i.

Apr 16 2003

"J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:

As UNICODE 3.0 shows, 16 bits is not enough.  It's looking like around 23
bits are needed, a value which of course isn't very practical on today's
computers.

   Dan

"Achilleas Margaritis" <axilmar in.gr> wrote in message
news:b7h8tm$t5j$1 digitaldaemon.com...
 [...]
 be transparent to the user. 65536 characters are enough to represent most
 Earth languages...

Apr 16 2003

D Programming

C/C++ Programming

Other

D - Why is there no String?