www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.string will get the boot

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I plan a few improvements to Phobos that will improve string handling.

Currently arrays of characters count as random-access ranges, which is 
not true for arrays of char and wchar. I plan to make std.range aware of 
that and only characterize char[] and wchar[] (and their qualified 
versions) as bidirectional ranges. Also, std.range will define s.front 
and s.back for strings to return the correctly decoded dchar. Naturally, 
s.popFront and s.popBack will yank an entire encoded character, which is 
what you want most of the time anyway. (You're still free to do s = s[1 
.. $] if that's what you need.)

These changes will have the great effect of enabling std.algorithm to 
work with strings correctly without any further impedance adaptation. 
(At some point I'd defined byDchar to wrap a string as a bidirectional 
range; it works, but of course it's much better without an intermediary.)

Following that change, I plan to eliminate std.string entirely and roll 
all of its functionality into std.algorithm. This is because I noticed 
that I'd like many string functions to be available for other data 
types, and also because people who want to define their own non-UTF 
encodings can benefit of the support that UTF already has.

(As an example, startsWith or endsWith are very useful not only with 
strings, but general data as well.)

A possible idea would be to move algorithms out of std.string and roll 
std.utf and std.encoding into std.string. That way std.string becomes 
something UTF-specific, which may be sensible.

One problem I foresee is the growth of std.algorithm. It already has 
many things in it, and I fear that some user who just wants to trim a 
string may find it intimidating to browse through all that 
documentation. I wonder how we could break std.algorithm into smaller 
units (which is an issue largely independent from generalizing the 
algorithms now found in std.string).

Any ideas are welcome.


Andrei
Jan 29 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 Currently arrays of characters count as random-access ranges, which is 
 not true for arrays of char and wchar. I plan to make std.range aware of 
 that and only characterize char[] and wchar[] (and their qualified 
 versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than one of such dchar. So dchar too may be a bidirectional range. I can't remember the bit size of wchar and dchar. So names like char, char16 and char32 can be better... Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be forced to use cast(ubyte[]) every time I use an algorithm on them :-)
 One problem I foresee is the growth of std.algorithm. It already has 
 many things in it, and I fear that some user who just wants to trim a 
 string may find it intimidating to browse through all that 
 documentation.

It's not just a matter of documentation: to choose among n items a human needs more time as n grows (people that designg important menus in GUIs must be aware of this). So huge APIs slow down programming. A possible solution is to keep the std.string module, but make it just a list of aliases and thin wrappers around functions of std.algorithm, tuned for string processing (example I usually don't need tolower on generic arrays), there are some operations that are mostly useful for strings). Bye, bearophile
Jan 29 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 Currently arrays of characters count as random-access ranges, which is 
 not true for arrays of char and wchar. I plan to make std.range aware of 
 that and only characterize char[] and wchar[] (and their qualified 
 versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]
 I can't remember the bit size of wchar and dchar. So names like char, char16
and char32 can be better...

I think it's a tad late for that.
 Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be forced
to use cast(ubyte[]) every time I use an algorithm on them :-)

That's exactly one of the cases in which my change would help. char is UTF-8, so that's out as an option for expressing ASCII characters. You'll be able to define your own type: struct AsciiChar { ubyte datum; ... } Then express stuff in terms of AsciiChar[] etc.
 One problem I foresee is the growth of std.algorithm. It already has 
 many things in it, and I fear that some user who just wants to trim a 
 string may find it intimidating to browse through all that 
 documentation.

It's not just a matter of documentation: to choose among n items a human needs more time as n grows (people that designg important menus in GUIs must be aware of this). So huge APIs slow down programming. A possible solution is to keep the std.string module, but make it just a list of aliases and thin wrappers around functions of std.algorithm, tuned for string processing (example I usually don't need tolower on generic arrays), there are some operations that are mostly useful for strings).

That's a good possibility. Andrei
Jan 29 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Simen kjaeraas wrote:
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 
 bearophile wrote:
 I can't remember the bit size of wchar and dchar. So names like char, 
 char16 and char32 can be better...

I think it's a tad late for that.

So adding aliases to object.d is not possible this late in the process? I'm not sure I want that to happen, just out of curiosity.

That would be possible. Andrei
Jan 30 2010
prev sibling parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
On 30-1-2010 1:59, Andrei Alexandrescu wrote:
 bearophile wrote:
 Andrei Alexandrescu:
 Currently arrays of characters count as random-access ranges, which
 is not true for arrays of char and wchar. I plan to make std.range
 aware of that and only characterize char[] and wchar[] (and their
 qualified versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF as the highest code point.
 Sometimes I have ugly 7-bit ASCII strings, I am not sure I want to be
 forced to use cast(ubyte[]) every time I use an algorithm on them :-)

That's exactly one of the cases in which my change would help. char is UTF-8, so that's out as an option for expressing ASCII characters. You'll be able to define your own type: struct AsciiChar { ubyte datum; ... } Then express stuff in terms of AsciiChar[] etc.

I miss typedef. I think this is exactly what typedef was intended for. Perhaps we can reintroduce it as a 'short hand' for such a struct? By the way, ASCII is a subset of UTF-8 (that was the whole point), so there's no reason why 'char[]' can't still be used for ASCII strings, right? L.
Jan 30 2010
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-01-30 22:06:06 -0500, Lionello Lunesu <lio lunesu.remove.com> said:

 On 30-1-2010 1:59, Andrei Alexandrescu wrote:
 bearophile wrote:
 Andrei Alexandrescu:
 Currently arrays of characters count as random-access ranges, which
 is not true for arrays of char and wchar. I plan to make std.range
 aware of that and only characterize char[] and wchar[] (and their
 qualified versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF as the highest code point.

32-bit is enough to cover all code points. But there are many combining code points in Unicode, allowing you to combine diacritic with various other characters, such as an acute accent with a 'k'. Some of these combinations exists in precombined form and are considered equivalent. So if you want to count the number of characters the user actually see instead of counting code points, then you need to take these combining code points into account. But if you really wanted to iterate over "characters" instead of code points, note that it can become quite hard if you take into account double diacritics, combining diacritic signs placed across two letters. So I think it's reasonable to have dchar, a code point, as the base unit for iterating over a string. http://en.wikipedia.org/wiki/Combining_character http://en.wikipedia.org/wiki/Unicode_normalization Another interesting case: http://en.wikipedia.org/wiki/Combining_grapheme_joiner Unicode, isn't it great? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 30 2010
prev sibling parent Lionello Lunesu <lio lunesu.remove.com> writes:
On 31-1-2010 16:34, Simen kjaeraas wrote:
 Lionello Lunesu <lio lunesu.remove.com> wrote:
 
 I miss typedef. I think this is exactly what typedef was intended
 for. Perhaps we can reintroduce it as a 'short hand' for such a
 struct?

struct Typedef( T ) { T payload; alias payload this; } Usage: alias Typedef!( int ) myInt; Is this what you want?

Using alias you loose all type safety. I remember Andrei mentioned that he and Walter couldn't agree whether typedef should behave as a sub or super class. I think it should not be looked at from a inheritance perspective, but just consider it as wrapper struct with a ctor that takes the underlying type.
 By the way, ASCII is a subset of UTF-8 (that was the whole
 point), so there's no reason why 'char[]' can't still be used for
 ASCII strings, right?

AS far as I have understood (I am no Unicode guru), in some locales toUpper and toLower map ASCII chars to non-ASCII chars. So ASCII being a strict subset of UTF-8 is not always true.

True, but then that upper resp lowercase would no longer be ASCII. As long as you stick to ASCII, char[] should work just fine. So, toLower and toUpper can accept ASCII char[] but always output one of those new char ranges. Problem fixed :) L.
Feb 01 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 bearophile wrote:
 I can't remember the bit size of wchar and dchar. So names like char,  
 char16 and char32 can be better...

I think it's a tad late for that.

So adding aliases to object.d is not possible this late in the process? I'm not sure I want that to happen, just out of curiosity. -- Simen
Jan 30 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Lionello Lunesu <lio lunesu.remove.com> wrote:

 I miss typedef. I think this is exactly what typedef was intended
 for. Perhaps we can reintroduce it as a 'short hand' for such a
 struct?

struct Typedef( T ) { T payload; alias payload this; } Usage: alias Typedef!( int ) myInt; Is this what you want?
 By the way, ASCII is a subset of UTF-8 (that was the whole
 point), so there's no reason why 'char[]' can't still be used for
 ASCII strings, right?

AS far as I have understood (I am no Unicode guru), in some locales toUpper and toLower map ASCII chars to non-ASCII chars. So ASCII being a strict subset of UTF-8 is not always true. -- Simen
Jan 31 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 31 Jan 2010 01:30:41 +0300, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 bearophile wrote:
 I can't remember the bit size of wchar and dchar. So names like char,  
 char16 and char32 can be better...

I think it's a tad late for that.

So adding aliases to object.d is not possible this late in the process? I'm not sure I want that to happen, just out of curiosity.

Everyone can do that on their own. I see no reason to pollute the namespace.
Jan 31 2010
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sun, 31 Jan 2010 11:34:03 +0300, Simen kjaeraas  =

<simen.kjaras gmail.com> wrote:

 Lionello Lunesu <lio lunesu.remove.com> wrote:

 I miss typedef. I think this is exactly what typedef was intended
 for. Perhaps we can reintroduce it as a 'short hand' for such a
 struct?

struct Typedef( T ) { T payload; alias payload this; } Usage: alias Typedef!( int ) myInt; Is this what you want?
 By the way, ASCII is a subset of UTF-8 (that was the whole
 point), so there's no reason why 'char[]' can't still be used for
 ASCII strings, right?

AS far as I have understood (I am no Unicode guru), in some locales =

 toUpper and toLower map ASCII chars to non-ASCII chars. So ASCII being=

 strict subset of UTF-8 is not always true.

I only know one example (in turkish): i < - > =C4=B0 =C4=B1 < - > I That's a big issue because toUpper/toLower needs a locale to provide = correct result.
Jan 31 2010
prev sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
On Sun, 31 Jan 2010 15:09:28 +0100, Denis Koroskin <2korden gmail.com>  
wrote:

 On Sun, 31 Jan 2010 01:30:41 +0300, Simen kjaeraas  
 <simen.kjaras gmail.com> wrote:

 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 bearophile wrote:
 I can't remember the bit size of wchar and dchar. So names like char,  
 char16 and char32 can be better...

I think it's a tad late for that.

So adding aliases to object.d is not possible this late in the process? I'm not sure I want that to happen, just out of curiosity.

Everyone can do that on their own. I see no reason to pollute the namespace.

Nor do I. I was only inquiring as to its feasibility. -- Simen
Jan 31 2010
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 1/29/10 18:36, Andrei Alexandrescu wrote:
 I plan a few improvements to Phobos that will improve string handling.

 Currently arrays of characters count as random-access ranges, which is
 not true for arrays of char and wchar. I plan to make std.range aware of
 that and only characterize char[] and wchar[] (and their qualified
 versions) as bidirectional ranges. Also, std.range will define s.front
 and s.back for strings to return the correctly decoded dchar. Naturally,
 s.popFront and s.popBack will yank an entire encoded character, which is
 what you want most of the time anyway. (You're still free to do s = s[1
 .. $] if that's what you need.)

 These changes will have the great effect of enabling std.algorithm to
 work with strings correctly without any further impedance adaptation.
 (At some point I'd defined byDchar to wrap a string as a bidirectional
 range; it works, but of course it's much better without an intermediary.)

 Following that change, I plan to eliminate std.string entirely and roll
 all of its functionality into std.algorithm. This is because I noticed
 that I'd like many string functions to be available for other data
 types, and also because people who want to define their own non-UTF
 encodings can benefit of the support that UTF already has.

I would keep std.string for string specific functions and perhaps publicly import std.algorithm. For exmaple functions like: tolower, icmp and toStringz.
 (As an example, startsWith or endsWith are very useful not only with
 strings, but general data as well.)

 A possible idea would be to move algorithms out of std.string and roll
 std.utf and std.encoding into std.string. That way std.string becomes
 something UTF-specific, which may be sensible.

 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

Perhaps it's time to start adding more packages than just the std. Make std.algorithm a package and try to split it into several modules.
 Any ideas are welcome.


 Andrei

Jan 29 2010
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Jacob Carlborg wrote:

 I would keep std.string for string specific functions and perhaps
 publicly import std.algorithm. For exmaple functions like: tolower, icmp
 and toStringz.

I've been thinking about characters lately and have realized that tolower, toupper, icmp, and friends should not be in a string library. Those functions need an "alphabet" to be useful; not language, nor locale... In fact, the character itself must have alphabet information. Otherwise a string like "ali & jim" cannot be converted to upper-case correctly(*) as "ALİ & JIM". And the word "correctly" there depends on each character's alphabet. Similarly, two characters that look the same cannot be compared for ordering. Comparing the 'x' of one alphabet to the 'x' of another alphabet is a meaningless operation. Ali
Jan 29 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ali Çehreli wrote:
 Jacob Carlborg wrote:
 
  > I would keep std.string for string specific functions and perhaps
  > publicly import std.algorithm. For exmaple functions like: tolower, icmp
  > and toStringz.
 
 I've been thinking about characters lately and have realized that 
 tolower, toupper, icmp, and friends should not be in a string library. 
 Those functions need an "alphabet" to be useful; not language, nor 
 locale...
 
 In fact, the character itself must have alphabet information. Otherwise 
 a string like "ali & jim" cannot be converted to upper-case correctly(*) 
 as "ALİ & JIM". And the word "correctly" there depends on each 
 character's alphabet.
 
 Similarly, two characters that look the same cannot be compared for 
 ordering. Comparing the 'x' of one alphabet to the 'x' of another 
 alphabet is a meaningless operation.

My thoughts exactly. In fact I'm thinking of generalizing toupper and tolower for strings to take an optional trie mapping strings to strings. That way correct capitalization can be done for any string, given a good collection of capitalization patterns. Andrei
Jan 29 2010
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 1/29/10 22:18, Ali Çehreli wrote:
 Jacob Carlborg wrote:

  > I would keep std.string for string specific functions and perhaps
  > publicly import std.algorithm. For exmaple functions like: tolower, icmp
  > and toStringz.

 I've been thinking about characters lately and have realized that
 tolower, toupper, icmp, and friends should not be in a string library.
 Those functions need an "alphabet" to be useful; not language, nor
 locale...

 In fact, the character itself must have alphabet information. Otherwise
 a string like "ali & jim" cannot be converted to upper-case correctly(*)
 as "ALİ & JIM". And the word "correctly" there depends on each
 character's alphabet.

 Similarly, two characters that look the same cannot be compared for
 ordering. Comparing the 'x' of one alphabet to the 'x' of another
 alphabet is a meaningless operation.

 Ali

I'm not sure I really understand this, probably because I don't know much about how Unciode works. I'm thinking out loud: If "i", as you have in "ali", have the corresponding "İ" as upper case wouldn't that be another character than the English "i"? If so, I'm not sure I see the problem. If not, I see the problem.
Jan 29 2010
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Jacob Carlborg wrote:
 On 1/29/10 22:18, Ali Çehreli wrote:
 Jacob Carlborg wrote:

  > I would keep std.string for string specific functions and perhaps
  > publicly import std.algorithm. For exmaple functions like: tolower,
 icmp
  > and toStringz.

 I've been thinking about characters lately and have realized that
 tolower, toupper, icmp, and friends should not be in a string library.
 Those functions need an "alphabet" to be useful; not language, nor
 locale...

 In fact, the character itself must have alphabet information. Otherwise
 a string like "ali & jim" cannot be converted to upper-case correctly(*)
 as "ALİ & JIM". And the word "correctly" there depends on each
 character's alphabet.

 Similarly, two characters that look the same cannot be compared for
 ordering. Comparing the 'x' of one alphabet to the 'x' of another
 alphabet is a meaningless operation.

 Ali

I'm not sure I really understand this, probably because I don't know much about how Unciode works. I'm thinking out loud: If "i", as you have in "ali", have the corresponding "İ" as upper case wouldn't that be another character than the English "i"?

'i' and 'i' are the same "character", because they have the same ASCII and Unicode values in different alphabets. But it is not the same "letter" when they are part of different text. iİ (and ıI) issue is probably too special. A number of Turkic alphabets chose ASCII 'i' probably for historical reasons. Unicode did not define a separate code point for 'i' either, probably because those alphabets already were using the ASCII 'i'.
 If so, I'm not
 sure I see the problem. If not, I see the problem.

The letter 'i' (and I) is special but the issue is valid for any other letter: Is it valid to compare an 'i' in English text to an 'i' in German text? I think it's only valid at the lowest data representation level. And ASCII never claims to be more than a code table for "information interchange". That part is fine. The problem is with the use of certain ranges of the ASCII table as the English alphabet. It is unfortunate that it works... :) D is great that it supports three separate Unicode encodings in the language, but encodings are at a lower level of abstraction than "letters". I am not sure what data is used for toUniUpper and toUniLower in std.uni, but they can't work correctly without alphabet information. They favor the ASCII layout probabyl because for historical reasons. I think the problems with using the ASCII table for sorting is well known. A more interesting example is with the Azeri alphabet: it uses the ASCII xX characters, but sorts them after hH. Ali
Jan 29 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ali Çehreli wrote:
 D is great that it supports three separate Unicode encodings in the 
 language, but encodings are at a lower level of abstraction than 
 "letters". I am not sure what data is used for toUniUpper and toUniLower 
 in std.uni, but they can't work correctly without alphabet information. 
 They favor the ASCII layout probabyl because for historical reasons.
 
 I think the problems with using the ASCII table for sorting is well 
 known. A more interesting example is with the Azeri alphabet: it uses 
 the ASCII xX characters, but sorts them after hH.

My idea of functions for upper/lowercase would help you solve exactly the issue you mention. A conversion trie as an optional parameter would allow to capitalize Straße as STRASSE and ali as ALİ. The trie will match the longest substring of the original string and will have translation strings in the nodes. The way capitalization is done will depend on the way you set up the table. Andrei
Jan 29 2010
prev sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Jacob Carlborg (doob me.com)'s article
 Perhaps it's time to start adding more packages than just the std. Make
 std.algorithm a package and try to split it into several modules.

Please, no. I **HATE** fine-grained imports like Tango has. I don't want to write tons of boilerplate at the top of every file just to have access to a bunch of closely related functionality. If this is done, **PLEASE** at least make a std.algorithm.all that publicly imports everything in the old std.algorithm.
Jan 29 2010
parent Jonathan M Davis <jmdavisProg gmail.com> writes:
dsimcha wrote:

 == Quote from Jacob Carlborg (doob me.com)'s article
 Perhaps it's time to start adding more packages than just the std. Make
 std.algorithm a package and try to split it into several modules.

Please, no. I **HATE** fine-grained imports like Tango has. I don't want to write tons of boilerplate at the top of every file just to have access to a bunch of closely related functionality. If this is done, **PLEASE** at least make a std.algorithm.all that publicly imports everything in the old std.algorithm.

We need a balance. Fine-grained can be great, but if it's too fine-grained, it gets hard to find things and you have to import a ton of modules. Not fine-grained enough, however, and you have a hard me finding things because there's so much to search through in each module - though importing what you need is easy. Personally, I'm fine with std.algorithm being split into sub-modules. It's already fairly large and splitting it up would make a lot of sense. But then a solution allowing you to import large portions - if not all of it - at once would definitely be nice. It's why being able to do something like import std.*; and have it recursively grab every sub-module would be nice. But std.algorithm.all is a good idea. - Jonathan M Davis
Jan 29 2010
prev sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
On 01/29/2010 06:36 PM, Andrei Alexandrescu wrote:
...
 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

 Any ideas are welcome.


 Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by adding a group tag. See this example of a summary of a class: http://www.naturaldocs.org/documenting/reference.html#Example_Class Probably it is possible to come up with categories for algorithm like: - functional tools - searching and sorting - string utilities ... Arguably a more D like alternative is to make std.algorithm a package and each 'category' a module of that package.
Jan 29 2010
next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
On 01/29/2010 09:13 PM, Lutger wrote:
 http://www.naturaldocs.org/documenting/reference.html#Example_Class

sorry, wrong anchor: http://www.naturaldocs.org/documenting/reference.html#Summaries
Jan 29 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Lutger wrote:
 On 01/29/2010 06:36 PM, Andrei Alexandrescu wrote:
 ...
 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

 Any ideas are welcome.


 Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by adding a group tag. See this example of a summary of a class: http://www.naturaldocs.org/documenting/reference.html#Example_Class Probably it is possible to come up with categories for algorithm like: - functional tools - searching and sorting - string utilities ... Arguably a more D like alternative is to make std.algorithm a package and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't require one to divide items in disjoint sets. I'll think some more of it. It might require changes in ddoc. At any rate, sounds like a D3 thing. Until then, I think I'll add to std.algorithm in confidence that we can scale the documentation later. Andrei
Jan 29 2010
next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
On 01/29/2010 09:18 PM, Andrei Alexandrescu wrote:
 Lutger wrote:
 On 01/29/2010 06:36 PM, Andrei Alexandrescu wrote:
 ...
 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

 Any ideas are welcome.


 Andrei

I like how naturaldocs, which is similar to ddoc helps with this: by adding a group tag. See this example of a summary of a class: http://www.naturaldocs.org/documenting/reference.html#Example_Class Probably it is possible to come up with categories for algorithm like: - functional tools - searching and sorting - string utilities ... Arguably a more D like alternative is to make std.algorithm a package and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't require one to divide items in disjoint sets. I'll think some more of it. It might require changes in ddoc. At any rate, sounds like a D3 thing. Until then, I think I'll add to std.algorithm in confidence that we can scale the documentation later. Andrei

Cool, tags are even better (naturaldocs groups aren't tags really). How are you going to do so? Perhaps better to reserve this as a standard ddoc section saying it is 'to be imlemented'? This way everybody can benefit eventually.
Jan 29 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 I think the idea of tags is awesome, particularly because it doesn't 
 require one to divide items in disjoint sets. I'll think some more of it.

A hierarchical D/Python-like module system isn't the only way to organize blocks of code. Both future Windows file system and Google Email use tags to create groups of items in a less disjoint way. But I don't know if it's possible to design the equivalent of a module system based on tags instead of a hierarchy of modules/packages (and superpackages). It seems a cute idea.
32 bits are not enough to represent certain "characters", they need more than
one of such dchar. So dchar too may be a bidirectional range.<<


I am far from expert about such hairy matters, so I can be wrong. This is from Wikipedia: http://en.wikipedia.org/wiki/UTF-32
Though a fixed number of bytes per code point seems convenient, it is not used
as much as the other Unicode encodings. It makes truncation slightly easier but
not significantly so compared to UTF-8 and UTF-16. It does not make calculating
the displayed width of a string any easier except in very limited cases, since
even with a "fixed width" font there may be more than one code point per
character position (combining marks) or more than one character position per
code point (for example CJK ideographs). Combining marks also mean editors
cannot treat one code point as being the same as one unit for editing.<

That paragraph of text also links to: http://en.wikipedia.org/wiki/Combining_character http://en.wikipedia.org/wiki/CJK Bye, bearophile
Jan 29 2010
parent reply Lutger <lutger.blijdestijn gmail.com> writes:
On 01/29/2010 09:43 PM, bearophile wrote:
 Andrei Alexandrescu:
 I think the idea of tags is awesome, particularly because it doesn't
 require one to divide items in disjoint sets. I'll think some more of it.

A hierarchical D/Python-like module system isn't the only way to organize blocks of code. Both future Windows file system and Google Email use tags to create groups of items in a less disjoint way. But I don't know if it's possible to design the equivalent of a module system based on tags instead of a hierarchy of modules/packages (and superpackages). It seems a cute idea.

This is about the documentation, which at the moment is based on the module system, type system and order of declarations. Such tags allow for better indexes, organization and search through the docs.
Jan 29 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Lutger wrote:
 On 01/29/2010 09:43 PM, bearophile wrote:
 Andrei Alexandrescu:
 I think the idea of tags is awesome, particularly because it doesn't
 require one to divide items in disjoint sets. I'll think some more of 
 it.

A hierarchical D/Python-like module system isn't the only way to organize blocks of code. Both future Windows file system and Google Email use tags to create groups of items in a less disjoint way. But I don't know if it's possible to design the equivalent of a module system based on tags instead of a hierarchy of modules/packages (and superpackages). It seems a cute idea.

This is about the documentation, which at the moment is based on the module system, type system and order of declarations. Such tags allow for better indexes, organization and search through the docs.

I don't think it would be too far-fetched to define and use tags for selective imports a la: // inside std.algorithm tag(string, comparison) bool startsWith(...)(...) { ... } // in client code // get everything tagged with "string" import std.algorithm : tag(string); Andrei
Jan 29 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 // in client code
 // get everything tagged with "string"
 import std.algorithm :  tag(string);

A next step is to allow to import all names with a specified tag, even if such names are inside more than one text file (the compiler can create a json txt file to speed up this retrieval): import tag(string); To keep things tidy I think it's better to minimize the number of different tags inside each file, so they are similar to modules anyway: perfect hierarchies are sometimes too much rigid to represent real life complexities, but an approximate hierarchy is tidier and simpler to understand than an amorphous soup of tags. Bye, bearophile
Jan 29 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Robert Jacques wrote:
 On Fri, 29 Jan 2010 15:18:14 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Lutger wrote:
 On 01/29/2010 06:36 PM, Andrei Alexandrescu wrote:
 ...
 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

 Any ideas are welcome.


 Andrei

adding a group tag. See this example of a summary of a class: http://www.naturaldocs.org/documenting/reference.html#Example_Class Probably it is possible to come up with categories for algorithm like: - functional tools - searching and sorting - string utilities ... Arguably a more D like alternative is to make std.algorithm a package and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't require one to divide items in disjoint sets. I'll think some more of it. It might require changes in ddoc. At any rate, sounds like a D3 thing. Until then, I think I'll add to std.algorithm in confidence that we can scale the documentation later. Andrei

By the way, in the sort term you could greatly improve the usability of std.algorithm by cleaning up the index ("jump to") at the top of the file. A simple alphabetical listing would be great and you could easily start grouping links under categories (which would eventually become tags)

That jump to index is automatically generated. I can have it sorted alphabetically, which makes sense for large lists. But then should I also list components in alphabetical order? Andrei
Jan 29 2010
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 29 Jan 2010 15:18:14 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Lutger wrote:
 On 01/29/2010 06:36 PM, Andrei Alexandrescu wrote:
 ...
 One problem I foresee is the growth of std.algorithm. It already has
 many things in it, and I fear that some user who just wants to trim a
 string may find it intimidating to browse through all that
 documentation. I wonder how we could break std.algorithm into smaller
 units (which is an issue largely independent from generalizing the
 algorithms now found in std.string).

 Any ideas are welcome.


 Andrei

adding a group tag. See this example of a summary of a class: http://www.naturaldocs.org/documenting/reference.html#Example_Class Probably it is possible to come up with categories for algorithm like: - functional tools - searching and sorting - string utilities ... Arguably a more D like alternative is to make std.algorithm a package and each 'category' a module of that package.

I think the idea of tags is awesome, particularly because it doesn't require one to divide items in disjoint sets. I'll think some more of it. It might require changes in ddoc. At any rate, sounds like a D3 thing. Until then, I think I'll add to std.algorithm in confidence that we can scale the documentation later. Andrei

By the way, in the sort term you could greatly improve the usability of std.algorithm by cleaning up the index ("jump to") at the top of the file. A simple alphabetical listing would be great and you could easily start grouping links under categories (which would eventually become tags)
Jan 29 2010