www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Evolution (Hello World)

reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Taking a quick look at "Hello World"
shows a remarkable language evolution...


 From C:
 #include <stdio.h>
 #include <stdlib.h>
 
 int main(void)
 {
   puts("Hello, World!");
   return EXIT_SUCCESS;
 }
 
 int main(int argc, char *argv[])
 {
   int i;
   for (i = 0; i < argc; i++)
     printf("%d %s\n", i, argv[i]);
   return EXIT_SUCCESS;
 }

To "old D":
 import std.c.stdio;
 import std.c.stdlib;
 
 int main()
 {
   puts("Hello, World!");
   return EXIT_SUCCESS;
 }
 
 int main(char[][] args)
 {
   for (int i = 0; i < args.length; i++)
     printf("%d %.*s\n", i, args[i]);
   return EXIT_SUCCESS;
 }

To "new D":
 import std.stdio;
 
 void main()
 {
   writeln("Hello, World!");
 }
 
 void main(str[] args)
 {
   foreach (int i, str a; args)
     writefln("%d %s", i, a);
 }

Where I took the liberty of adding a few of my own RFEs: 1) "void main" should return 0 back to the operating system 2) new std.stdio.writeln, a formatless version of writefln 3) the "str" alias for the char[] type, like "bool" for bit Not too bad for the first five years, if I say so myself ? Then again, I couldn't even use it here until GDC arrived... And it was not even a year ago, that David Friedman did that. --anders
Feb 09 2005
parent reply Sebastian Beschke <s.beschke gmx.de> writes:
Anders F Björklund schrieb:
 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]? I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise. -Sebastian
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Sebastian Beschke wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...
 I think, considering that UTF-8 is not the optimal encoding in a huge 
 number of cases, promoting it as "The One and Only String Type" would 
 not be wise.

There is no "one and only one string type" in D, just as there is no "one and only one boolean type". There are three of each, and char[] is the preferred string type (what does "main" use ?) so it gets to be the str. And bit is the type of "true" and "false" so it gets to be the default bool type. If you want to speed up or optimize your code, you can change to using wchar[] or wbool[]... And when it is needed in a few places, you have dchar[] and dbool The contents are exactly the same, just encoded differently - all strings are in Unicode and all booleans are in Zero-is-False. I just find this shortform to be easier on the eyes: void main(str[] args); str[str] dictionary; UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower) --anders
Feb 10 2005
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
I wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...

Another possibility is wstr for wchar[] ("wide string") and ustr for dchar[] ("Unicode string"), which might perhaps work better and be a tad more logical too... I like "str" better than "string", because it: 1) rhymes with int, char, bool and the others 2) is shorter to type, easilly 50% saved 3) doesn't confuse anyone with C++ std::string --anders
Feb 10 2005
parent reply "Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:
"Anders F Björklund" <afb algonet.se> wrote in message 
news:cugd6f$2ptf$1 digitaldaemon.com...
I wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...

Another possibility is wstr for wchar[] ("wide string") and ustr for dchar[] ("Unicode string"), which might perhaps work better and be a tad more logical too... I like "str" better than "string", because it: 1) rhymes with int, char, bool and the others 2) is shorter to type, easilly 50% saved 3) doesn't confuse anyone with C++ std::string

I don't think char[] should have an alias. Strings in D are slices, for very good reason, and it's good for that to be foremost in peoples' minds.
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Matthew wrote:

 I don't think char[] should have an alias. Strings in D are slices, for 
 very good reason, and it's good for that to be foremost in peoples' 
 minds.

I thought that strings in D were sliceable codepoint arrays, but not necessarily slices always ? It's just an alias, the type is still char[] ? (and wchar[] and dchar(), but anyway) But I rethought and found "ustr" to be silly altogether... alias char[] str; // ASCII-optimized alias wchar[] wstr; // Unicode-optimized alias dchar[] dstr; // codepoint-optimized More orthogonal that way ? (with "char[]" = "str", always) char[] by itself is actually not that bad, but this is: int main(char[][] args); char[][char[]] dictionary; --anders
Feb 10 2005
parent reply "Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:
 char[] by itself is actually not that bad, but this is:
 int main(char[][] args);
 char[][char[]] dictionary;

Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Matthew wrote:

 Now you've got something of a point there. But, still, I'd prefer to 
 leave it as char[]. The example you give is only 1-dim string / 2-dim 
 char. What about higher dimensionality (of anything)? We could end up in 
 the cow-dung of LPPPCSTR, etc. 

Ehrm, nooooo ? "The line must be drawn here". :-) I just wanted some easier basics, for beginners ? For the higher levels, you still need to learn about bit and char[] and other behind-the-scenes. It's just similar to the "alias foo* fooPtr;", that seems to always enter the picture after one has seen one too many stars fly by... I'll (re)post my Grand Scheme of Std Aliases. --anders
Feb 10 2005
parent "Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:
"Anders F Björklund" <afb algonet.se> wrote in message 
news:cugfi7$2sll$2 digitaldaemon.com...
 Matthew wrote:

 Now you've got something of a point there. But, still, I'd prefer to 
 leave it as char[]. The example you give is only 1-dim string / 2-dim 
 char. What about higher dimensionality (of anything)? We could end up 
 in the cow-dung of LPPPCSTR, etc.

Ehrm, nooooo ? "The line must be drawn here". :-) I just wanted some easier basics, for beginners ? For the higher levels, you still need to learn about bit and char[] and other behind-the-scenes.

I know. And I like your sentiment. It's just that I think that the string-is-a-slice concept is so important and fundamental to D that it's more likely to be disservice in the medium/long term.
Feb 10 2005
prev sibling parent reply Derek <derek psych.ward> writes:
On Thu, 10 Feb 2005 20:01:50 +0100, Anders F Björklund wrote:

 Sebastian Beschke wrote:
 
 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...
 I think, considering that UTF-8 is not the optimal encoding in a huge 
 number of cases, promoting it as "The One and Only String Type" would 
 not be wise.

There is no "one and only one string type" in D, just as there is no "one and only one boolean type". There are three of each, and char[] is the preferred string type (what does "main" use ?) so it gets to be the str. And bit is the type of "true" and "false" so it gets to be the default bool type. If you want to speed up or optimize your code, you can change to using wchar[] or wbool[]... And when it is needed in a few places, you have dchar[] and dbool The contents are exactly the same, just encoded differently - all strings are in Unicode and all booleans are in Zero-is-False. I just find this shortform to be easier on the eyes: void main(str[] args); str[str] dictionary; UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)

One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32. So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8. -- Derek Melbourne, Australia
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek wrote:

UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.

This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string. See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:
 Code-point boundaries, iteration, and indexing are very fast with
 UTF-32. Code-point boundaries, accessing code points at a given offset,
 and iteration involve a few extra machine instructions for UTF-16; UTF-8
 is a bit more cumbersome. Indexing is slow for both of them, but in
 practice indexing by different code units is done very rarely, except
 when communicating with specifications that use UTF-32 code units, such
 as XSL.
 
 This point about indexing is true unless an API for strings allows
 access only by code point offsets. This is a very inefficient design:
 strings should always allow indexing with code unit offsets.

But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units... Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?
 So if using utf8, and one is doing some form of character manipulation, one
 should first convert to utf32, do the work, then convert back to utf8.

Yes, and this is easily done with a foreach(dchar c; SomeText) loop, as D can transparently handle the transition between char[] and dchar... There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers. If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term. --anders
Feb 10 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Björklund wrote:

 Derek wrote:
 
UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.

This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.

I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.
 See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:
 
 Code-point boundaries, iteration, and indexing are very fast with
 UTF-32. Code-point boundaries, accessing code points at a given offset,
 and iteration involve a few extra machine instructions for UTF-16; UTF-8
 is a bit more cumbersome. Indexing is slow for both of them, but in
 practice indexing by different code units is done very rarely, except
 when communicating with specifications that use UTF-32 code units, such
 as XSL.
 
 This point about indexing is true unless an API for strings allows
 access only by code point offsets. This is a very inefficient design:
 strings should always allow indexing with code unit offsets.


Yes, and a simple index into a char[] doesn't do this for you.
 
 But char[] works fine for ASCII and wchar[] works fine for Unicode,
 *as long* as you watch out for any surrogates in the code units...
 
 Which means you can have a fast standard route, and extra code
 to handle the exceptional characters if and when they occur ?

'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.
 So if using utf8, and one is doing some form of character manipulation, one
 should first convert to utf32, do the work, then convert back to utf8.

Yes, and this is easily done with a foreach(dchar c; SomeText) loop, as D can transparently handle the transition between char[] and dchar...

Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' is not permitted.
 There are also readily available functions in the std.utf module:
 "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

Exactly my point. One needs to use these if *manipulating* characters in a utf8 or utf16 string.
 If you lot of loops like that, you can use a dchar[] (dstr alias) as a
 intermediate storage. But char[] and wchar[] are better for long term.

'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-) -- Derek Melbourne, Australia 11/02/2005 9:38:27 AM
Feb 10 2005
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek Parnell wrote:

But char[] works fine for ASCII and wchar[] works fine for Unicode,
*as long* as you watch out for any surrogates in the code units...

Which means you can have a fast standard route, and extra code
to handle the exceptional characters if and when they occur ?

'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.

No, but it's mine ;-) (the ignorant westerner that I am) Seriously, in my own language - Swedish, about 10% of the text is non ASCII, which means that Walters optimized US ASCII parts runs for 90% of the time. I assume this is the same for the rest of the previously ISO-8859-X using Western world languages... Had I been using another alphabet, like Japanese or Chinese, then UTF-16 had been a nice bet. Surrogate characters are not occuring very often, in fact they were just now introduced in Java 1.5 since the original 16 bits of Unicode "overflowed". So I think there's a 90-10 rule here too, with non-Surrogates. So I do think talking about "exceptions" is warranted ?
Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
as D can transparently handle the transition between char[] and dchar...

Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' is not permitted.

We are talking Copy-on-Write here, yes ? As in reading from readonly and writing to readwrite ? Otherwise you could use dchar[] instead, and do a simple indexing. (or a foreach(inout dchar c; SomeText on it) And convert from UTF-8/UTF-16 on the way in, do all the processing on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on the way out. (most routines now do include a dchar[] interface too, you can even use dchar[] in switch/case statements - if you like)
If you lot of loops like that, you can use a dchar[] (dstr alias) as a
intermediate storage. But char[] and wchar[] are better for long term.

'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-)

Storage. Even with all "silly alphabets" utilized, there are still 11 dead bits in each UTF-32 character. UTF-16 is bound to more efficient. Unless you are doing extinct languages research or something? :-) It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32 --anders
Feb 10 2005
parent James McComb <ned jamesmccomb.id.au> writes:
Anders F Björklund wrote:

 Had I been using another alphabet, like Japanese or Chinese,
 then UTF-16 had been a nice bet. Surrogate characters are not
 occuring very often, in fact they were just now introduced
 in Java 1.5 since the original 16 bits of Unicode "overflowed".
 So I think there's a 90-10 rule here too, with non-Surrogates.

Does anyone here know if Japanese and Chinese use a lot of ASCII punctuation? If they do, then maybe UTF-8 is reasonable. James McComb
Feb 10 2005
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Derek Parnell" <derek psych.ward> wrote in message
news:uk7573l4ag4s.fkp4buj0rl0e.dlg 40tude.net...
 This is not that much of a problem, since you should not address
 individual code points anyway but treat the code units as a string.

I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.

Take a look at the functions std.utf.stride, std.utf.toUCSindex, and std.utf.toUTFindex. They provide the basic building blocks to manipulate UTF-8 strings as if they were an array of UCS characters.
Feb 11 2005