www.digitalmars.com         C & C++   DMDScript  

D - Why is there no String?

reply Helmut Leitner <leitner hls.via.at> writes:
I tried something like:

   alias char [] string;
  
   int main(string [] args) 
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
I just posted in a separate thread would look more beautiful
written as
    String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
wouldn't it?

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com
Apr 04 2003
next sibling parent "Luna Kid" <lunakid neuropolis.org> writes:
Huhh, now this is going to be a long thread... :)

I think I completely agree with Helmut, and with Matthew W.,
and Mark E. and probably many others. And I let them talk
-- they can do it better.

So the battle has begun... Good luck, folks! :)

Cheers,
Sab


"Helmut Leitner" <leitner hls.via.at> wrote in message
news:3E8D5571.E6234E26 hls.via.at...
 I tried something like:

    alias char [] string;

    int main(string [] args)
    {
        printf("%s\n",(char *)args[0]);
        return 0;
    }

 hey, and it worked!

 But only until I added

    import string;

    :-(

 Ok, no problem. But the question is: Why do you live without
 getting rid of this "string is an char array" type of thinking.
 Although it is the case, there is no need to reflect this in
 all interfaces and the daily work.

 E. g. the call
     char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
 I just posted in a separate thread would look more beautiful
 written as
     String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
 wouldn't it?

 --
 Helmut Leitner    leitner hls.via.at
 Graz, Austria   www.hls-software.com

Apr 04 2003
prev sibling parent reply Jonathan Andrew <Jonathan_member pathlink.com> writes:
Hello,

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?). Basically, stuff that has already been talked about in
here.
A dedicated string type might resolve some of this ambiguity by providing, for
example, both .length (characters) and .size properties (byte-size). Stuff that
is important for strings, but not really appropriate for other array types. I
don't really care too much either way, and if we are stuck with good old ASCII,
it really doesn't matter either way. But if Unicode is put in, then some
mechanism should be put in place to take care of these issues, whether its a
string type or not. And yes, this is probably going to be the start of a long,
painful thread. =)

-Jon

In article <3E8D5571.E6234E26 hls.via.at>, Helmut Leitner says...
I tried something like:

   alias char [] string;
  
   int main(string [] args) 
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
I just posted in a separate thread would look more beautiful
written as
    String [] files = DirFindFile('c:\dmd','*.d',DFF_FILES|DFF_RECURSIVE);
wouldn't it?

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 04 2003
next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
Not long; not painful. Simply agree



"Jonathan Andrew" <Jonathan_member pathlink.com> wrote in message
news:b6ktj0$2o0u$1 digitaldaemon.com...
 Hello,

 I think that an array of chars was probably appropriate, but now that

 being considered for the language, I think a primitive string type might

 necessary, because an array of uneven sized chars would be very awkward

 talking about indexing (ie. is mystr[6] talking about the 6th byte, or the

 character?) and declaring new strings ("char [40] mystr;", is this 40

 40 characters long?). Basically, stuff that has already been talked about

 here.
 A dedicated string type might resolve some of this ambiguity by providing,

 example, both .length (characters) and .size properties (byte-size). Stuff

 is important for strings, but not really appropriate for other array

 don't really care too much either way, and if we are stuck with good old

 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether its

 string type or not. And yes, this is probably going to be the start of a

 painful thread. =)

 -Jon

 In article <3E8D5571.E6234E26 hls.via.at>, Helmut Leitner says...
I tried something like:

   alias char [] string;

   int main(string [] args)
   {
       printf("%s\n",(char *)args[0]);
       return 0;
   }

hey, and it worked!

But only until I added

   import string;

   :-(

Ok, no problem. But the question is: Why do you live without
getting rid of this "string is an char array" type of thinking.
Although it is the case, there is no need to reflect this in
all interfaces and the daily work.

E. g. the call
    char [][] files =


I just posted in a separate thread would look more beautiful
written as
    String [] files =


wouldn't it?

--
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com


Apr 04 2003
prev sibling next sibling parent reply Benji Smith <Benji_member pathlink.com> writes:
Would it be appropriate to think of a String class within a template. You could
declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
instantiating the String template.

I know that this isn't what templates are technically desinged to do, but it
seems like a good idea to me.

--Benji

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?).

Apr 04 2003
next sibling parent reply "Luna Kid" <lunakid neuropolis.org> writes:
Sounds like quite a good idea to me.

Well, but first, it would be good to have that
generic string "something". ;)

Currently, calling an array of bytes a string
is like calling an Otto engine a car... (It's
not even a class inheritance thing.)

Cheers,
Lunar Sab


"Benji Smith" <Benji_member pathlink.com> wrote in message
news:b6l4vj$2t15$1 digitaldaemon.com...
 Would it be appropriate to think of a String class within a template. You

 declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
 instantiating the String template.

 I know that this isn't what templates are technically desinged to do, but

 seems like a good idea to me.

 --Benji

I think that an array of chars was probably appropriate, but now that


being considered for the language, I think a primitive string type might


necessary, because an array of uneven sized chars would be very awkward


talking about indexing (ie. is mystr[6] talking about the 6th byte, or


character?) and declaring new strings ("char [40] mystr;", is this 40


40 characters long?).


Apr 04 2003
parent reply Helmut Leitner <leitner hls.via.at> writes:
Luna Kid wrote:
 Currently, calling an array of bytes a string
 is like calling an Otto engine a car... (It's
 not even a class inheritance thing.)

Sorry, I can't agree. In programming we deal with real world objects and give them names (like File, String or System). These names may be thought to represent virtual objects, that may be handled - by handles (like MS windows HWND, file handles) - by names (like a customer in a database, file delete) - by OO object references - implicitly withou handle ( SystemRestart(); ) - ... I think that the OO feeling is wrong, that only a class is a good abstraction and that anything of importance must become a class. This feeling developed, because OO is overhyped to a point where this hype effectively reduces the quality of resulting code. There can be no better API functions than FileDelete("test.txt"); String []=UrlGetStringArray("http://walterbright.com"); because they express what the programmer wants to do in a single "sentence" *and* do it. Notice that there is not a single formal object involved. So IMHO, whether something is a formal object or not, should be considered an implementation detail. If Unicode strings are a topic (I do know little about them, I always thought we would just switch to 16-Bit-characters sometime in the future) then someone could create a class to wrap its complexities. I would vote for "class Struni" and suggest that a team forms to prototype it to the point, where Walter can just take it and make it part of Phobos. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 05 2003
parent reply "J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:
For better or worse, UNICODE strings are no long as simple as "16-bit
characters".  That's just one example of why this is a rather thorny issue.

   Dan

"Helmut Leitner" <leitner hls.via.at> wrote in message
news:3E8E9AAF.865E0DFF hls.via.at...
 Luna Kid wrote:
 Currently, calling an array of bytes a string
 is like calling an Otto engine a car... (It's
 not even a class inheritance thing.)

Sorry, I can't agree. In programming we deal with real world objects and give them names (like File, String or System). These names may be thought to represent virtual objects, that may be handled - by handles (like MS windows HWND, file handles) - by names (like a customer in a database, file delete) - by OO object references - implicitly withou handle ( SystemRestart(); ) - ... I think that the OO feeling is wrong, that only a class is a good abstraction and that anything of importance must become a class. This feeling developed, because OO is overhyped to a point where this hype effectively reduces the quality of resulting code. There can be no better API functions than FileDelete("test.txt"); String []=UrlGetStringArray("http://walterbright.com"); because they express what the programmer wants to do in a single "sentence" *and* do it. Notice that there is not a single formal object involved. So IMHO, whether something is a formal object or not, should be considered an implementation detail. If Unicode strings are a topic (I do know little about them, I always thought we would just switch to 16-Bit-characters sometime in the future) then someone could create a class to wrap its complexities. I would vote for "class Struni" and suggest that a team forms to prototype it to the point, where Walter can just take it and make it part of Phobos. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com

Apr 05 2003
parent Helmut Leitner <helmut.leitner chello.at> writes:
"J. Daniel Smith" wrote:
 
 For better or worse, UNICODE strings are no long as simple as "16-bit
 characters".  That's just one example of why this is a rather thorny issue.

I understand that, but the existance of certain encodings - e. g. in a web page - doesn't immediately mean to me that there is the need for a native type or a class handling these encodings (maybe to have transforming functions is enough). I suppose that this comes from my lack of knowledge in this area. I hadn't yet time to read the existing older threads about this topic. Is there a summary of all the issues somewhere? -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 06 2003
prev sibling parent Jonathan Andrew <Jonathan_member pathlink.com> writes:
Howdy,

I think that the templated way of doing strings is probably adding more
abstraction and complication to the string concept, which at this point should
be a primitive type in the language, whether it is a completely new type or
still an array of chars with special properties. That's just my opinion though,
and I confess to being afraid of templates, coming mostly from a C/Java
background. 

-Jon

In article <b6l4vj$2t15$1 digitaldaemon.com>, Benji Smith says...
Would it be appropriate to think of a String class within a template. You could
declare your string to be a UTF-8 or UTF-16 or ECDEC or whatever simply by
instantiating the String template.

I know that this isn't what templates are technically desinged to do, but it
seems like a good idea to me.

--Benji

I think that an array of chars was probably appropriate, but now that Unicode is
being considered for the language, I think a primitive string type might be
necessary, because an array of uneven sized chars would be very awkward when
talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
40 characters long?).


Apr 04 2003
prev sibling parent reply Ilya Minkov <midiclub 8ung.at> writes:
Hello.

While char[] is a good and native thing for working with console, simple 
textfiles, and such, it is not a solution for applications processing 
any data subject to internalisation. And that is almost every piece of 
text currently out there.

However, i keep thinking that one dedicated string class is not at all 
enough. I propose - not one - not two - but *at least* three of them.

First and the basic one -
--- a String type - should be an array of 4-byte characters. It is used 
inside functions for processing strings. With modern processors, 
handling 4-byte values may be cheaper than 2-byte and not much costier 
than of 1-byte. As to space considerations - forget them, this type is 
for local chewing only. If you want to keep this string in memory or 
some database consider the second one -
--- a CompactString type - should consist of 2 arrays, first one for raw 
characters, second one for mode changes. The second one is the key. It 
should store a list of events like "at character x change to codepage y, 
encoding z" "or at character x make an exceptional 4-byte value", which 
could be swizzled into a few bytes each. It should also be quite fast to 
handle, since unlike UTF7/8/16, the raw string need not be scanned to 
determine its length, this can be done by scanning mode changes, which 
has to be an order or two of magnitude shorter. And it can adapt itself 
to whatever takes the least space - 8-bit with explicit codepage for 
e.g. european and russian, 16-bit for japanese kanji and somesuch, or 
even 32-bit in rare case you mix all languages evenly. But this type 
would not be directly standards-complying. There should obviously also be -
--- another type which corresponds to the underlying system's preferred 
encoding.

A set of functions also has to be provided to convert any of these types 
to and from any of the other standard unicode types. As to templates - i 
don't hold much of them for these purposes. There is a limited number of 
types - you don't want to create a string of floats, do you? And 
besides, their handling differs in some ways. But making them into 
classes could give further flexibility, at the price of an (8-byte IIRC) 
space overhead per instance.

-i.

PS. i've been away for a short while... 300 new messages! i wonder how 
Walter manages to read them AND maintain two complex compilers!


Jonathan Andrew wrote:
 Hello,
 
 I think that an array of chars was probably appropriate, but now that Unicode
is
 being considered for the language, I think a primitive string type might be
 necessary, because an array of uneven sized chars would be very awkward when
 talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th
 character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or
 40 characters long?). Basically, stuff that has already been talked about in
 here.
 A dedicated string type might resolve some of this ambiguity by providing, for
 example, both .length (characters) and .size properties (byte-size). Stuff that
 is important for strings, but not really appropriate for other array types. I
 don't really care too much either way, and if we are stuck with good old ASCII,
 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether its a
 string type or not. And yes, this is probably going to be the start of a long,
 painful thread. =)
 
 -Jon

Apr 07 2003
parent reply "Achilleas Margaritis" <axilmar in.gr> writes:
"Ilya Minkov" <midiclub 8ung.at> wrote in message
news:b6t1i8$nt4$1 digitaldaemon.com...
 Hello.

 While char[] is a good and native thing for working with console, simple
 textfiles, and such, it is not a solution for applications processing
 any data subject to internalisation. And that is almost every piece of
 text currently out there.

 However, i keep thinking that one dedicated string class is not at all
 enough. I propose - not one - not two - but *at least* three of them.

 First and the basic one -
 --- a String type - should be an array of 4-byte characters. It is used
 inside functions for processing strings. With modern processors,
 handling 4-byte values may be cheaper than 2-byte and not much costier
 than of 1-byte. As to space considerations - forget them, this type is
 for local chewing only. If you want to keep this string in memory or
 some database consider the second one -
 --- a CompactString type - should consist of 2 arrays, first one for raw
 characters, second one for mode changes. The second one is the key. It
 should store a list of events like "at character x change to codepage y,
 encoding z" "or at character x make an exceptional 4-byte value", which
 could be swizzled into a few bytes each. It should also be quite fast to
 handle, since unlike UTF7/8/16, the raw string need not be scanned to
 determine its length, this can be done by scanning mode changes, which
 has to be an order or two of magnitude shorter. And it can adapt itself
 to whatever takes the least space - 8-bit with explicit codepage for
 e.g. european and russian, 16-bit for japanese kanji and somesuch, or
 even 32-bit in rare case you mix all languages evenly. But this type
 would not be directly standards-complying. There should obviously also

 --- another type which corresponds to the underlying system's preferred
 encoding.

 A set of functions also has to be provided to convert any of these types
 to and from any of the other standard unicode types. As to templates - i
 don't hold much of them for these purposes. There is a limited number of
 types - you don't want to create a string of floats, do you? And
 besides, their handling differs in some ways. But making them into
 classes could give further flexibility, at the price of an (8-byte IIRC)
 space overhead per instance.

 -i.

 PS. i've been away for a short while... 300 new messages! i wonder how
 Walter manages to read them AND maintain two complex compilers!


 Jonathan Andrew wrote:
 Hello,

 I think that an array of chars was probably appropriate, but now that


 being considered for the language, I think a primitive string type might


 necessary, because an array of uneven sized chars would be very awkward


 talking about indexing (ie. is mystr[6] talking about the 6th byte, or


 character?) and declaring new strings ("char [40] mystr;", is this 40


 40 characters long?). Basically, stuff that has already been talked


 here.
 A dedicated string type might resolve some of this ambiguity by


 example, both .length (characters) and .size properties (byte-size).


 is important for strings, but not really appropriate for other array


 don't really care too much either way, and if we are stuck with good old


 it really doesn't matter either way. But if Unicode is put in, then some
 mechanism should be put in place to take care of these issues, whether


 string type or not. And yes, this is probably going to be the start of a


 painful thread. =)

 -Jon


Why complicate our lives ? D should use 16-bit unicode and provide implicit conversions in any I/O, according to environment. These conversions should be transparent to the user. 65536 characters are enough to represent most Earth languages...
Apr 15 2003
next sibling parent Ilya Minkov <midiclub tiscali.de> writes:
Achilleas Margaritis wrote:
 
 Why complicate our lives ? D should use 16-bit unicode and provide implicit
 conversions in any I/O, according to environment. These conversions should
 be transparent to the user. 65536 characters are enough to represent most
 Earth languages...
 

It doesn't yet make life easy. IIRC you only have less than 1/4 of this set available. Besides, how do you treat separate accents, if they cannot be combined into the letters? This is really rare though. 32-bit is better for speed. And besides, you have already let everyone down for years with that endless "we don't care, since *our* language fits in less than seven bits. now the rest of teh world may share the leftovers, if they like." Now, aren't we letting anyone down with 16-bit? Besides, the second type i proposed, would usually use 1 byte for european, cyrillic, arabic, hebrew, greek and such, unlike UTF-8 and UTF-16 which *both* requiere 2, so it's better for storage! -i.
Apr 16 2003
prev sibling parent "J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:
As UNICODE 3.0 shows, 16 bits is not enough.  It's looking like around 23
bits are needed, a value which of course isn't very practical on today's
computers.

   Dan

"Achilleas Margaritis" <axilmar in.gr> wrote in message
news:b7h8tm$t5j$1 digitaldaemon.com...
 [...]
 be transparent to the user. 65536 characters are enough to represent most
 Earth languages...

Apr 16 2003