digitalmars.D - Phobos strings versus C++ Boost

Brad Anderson (26/26) Jan 10 2014 The recent discussion got me wondering how Phobos stacked up

Jakob Ovrum (17/36) Jan 11 2014 Some comments:

Brad Anderson (21/52) Jan 11 2014 The documentation needs to be improved for canFind then. It takes

Michel Fortin (13/17) Jan 11 2014 Uppercase, lowercase, and case-insensitive comparison is

Brad Anderson (4/21) Jan 11 2014 I thought it was probably more complicated than that.

monarch_dodra (2/7) Jan 11 2014 You should read the report...

Brad Anderson (3/12) Jan 11 2014 I meant more complicated than toLower. I'm already plenty

Jacob Carlborg (4/10) Jan 11 2014 toLower/Upper doesn't really work in place.

monarch_dodra (4/15) Jan 11 2014 Yeah, "toLowerInplace" is actually more like

Dmitry Olshansky (6/20) Jan 11 2014 With high probablity :)

Jacob Carlborg (4/7) Jan 12 2014 The least we can do is make that very clear in the documentation.
Tobias Pankrath (4/26) Jan 12 2014 More important than the absolute amount of "bad sheep" is the

Dominikus Dittes Scherkl (6/14) Jan 13 2014 In german the frequency of "ß" is 0.31% and the mess with getting

Michel Fortin (10/22) Jan 13 2014 The funny thing about "ß" is that in UTF-8 it's two bytes (0xC3 0x9F)

Brad Anderson (5/16) Jan 11 2014 Yeah, it's kind of an argument for and against Phobos/D. InPlace

Andrei Alexandrescu (4/10) Jan 11 2014 [snip]

Brad Anderson (14/28) Jan 11 2014 I'll probably just make an issue for each group of problems after

"Brad Anderson" <eco gnuk.net> writes:

The recent discussion got me wondering how Phobos stacked up
against the C++ Boost String Algorithms library.

Some background on the design of the Boost library:
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

TL;DR: It works somewhat like ranges.

Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

I wouldn't be surprised if I missed functions that would do
things easily but I did look reasonably hard for ways to
accomplish things. Do share if you spot anything I missed but
everything should be intuitive rather than clever.

A few things stand out:

1. They have case-insensitive versions of pretty much everything.
It's not hard to do a map!toLower/toUpper in D but it's also not
obvious (nor do I know if that's actually correct in languages
outside of english).

2. Replace and erase options a very slim. Doing something like a
chain() on the results of findSplit() and what you want to inject
I guess would work for replacing but that's really not very
elegant. remove() is simply way too cumbersome to use. I guess
you could use indexOf, then indexOf again with a slice with the
first result, then pass both two a tuple in remove. That's
terrible though.

3. Doing an action several times rather than once is tricky.  As
in, there is no findAll() that returns a range of ranges. Doing
the things mentioned in 2 several times over a whole range just
adds another level of complication.

Jan 10 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Saturday, 11 January 2014 at 07:50:56 UTC, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

Some comments:

  * `empty` is a property - do not append parentheses/call syntax
  * `!find().empty` => `canFind` or `any`
  * `ifind_first/last` can use `find!((a, b) => a.toLower() == 
b.toLower())`
  * I think the Phobos equivalent of `find_tail` needs a second 
`retro`?
  * I don't like the idea of adding a predicate to joiner, I think 
using filter is better

 1. They have case-insensitive versions of pretty much 
 everything.
 It's not hard to do a map!toLower/toUpper in D but it's also not
 obvious (nor do I know if that's actually correct in languages
 outside of english).

There are two pairs of toLower/toUpper - the ones in std.ascii 
and std.uni (the std.string pair aliases to std.uni). The latter 
pair works correctly for all scripts.

 2. Replace and erase options a very slim. Doing something like a
 chain() on the results of findSplit() and what you want to 
 inject
 I guess would work for replacing but that's really not very
 elegant. remove() is simply way too cumbersome to use. I guess
 you could use indexOf, then indexOf again with a slice with the
 first result, then pass both two a tuple in remove. That's
 terrible though.

I think the mutation algorithms in std.algorithm can handle most 
of these when used in conjunction with other algorithms, except 
that narrow strings do not have the property of assignable 
elements, which is kind of a fatal blow.

Jan 11 2014

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 11 January 2014 at 08:25:39 UTC, Jakob Ovrum wrote:

 Some comments:

  * `empty` is a property - do not append parentheses/call syntax

*Nod*

  * `!find().empty` => `canFind` or `any`

The documentation needs to be improved for canFind then. It takes
an `E needle` so I assumed it was an element type only.  The
other overload of canFind takes `Ranges needles` and stops when
it finds just one of them so I assumed it'd be called in the case
assert("123".canFind("321")) and would be true (>0). Looks like
the first overload just hands off to find() which can do either
element type or a subrange but that's not clear from the
documentation.

any() needs some examples. I'm not sure how it'd be used for this
purpose.

I'll try to make some pull requests to fix both of these doc
issues.

  * `ifind_first/last` can use `find!((a, b) => a.toLower() == 
 b.toLower())`

Yeah, but as Michael pointed out this isn't really a valid way to
do case-insensitive comparison anyway.

  * I think the Phobos equivalent of `find_tail` needs a second 
 `retro`?

Yeah, very ugly.

  * I don't like the idea of adding a predicate to joiner, I 
 think using filter is better

I just figured for consistency since so much of std.algorithm
accepts a predicate. I'm not opposed to sticking with filter
though.

 1. They have case-insensitive versions of pretty much 
 everything.
 It's not hard to do a map!toLower/toUpper in D but it's also 
 not
 obvious (nor do I know if that's actually correct in languages
 outside of english).

 There are two pairs of toLower/toUpper - the ones in std.ascii 
 and std.uni (the std.string pair aliases to std.uni). The 
 latter pair works correctly for all scripts.

 2. Replace and erase options a very slim. Doing something like 
 a
 chain() on the results of findSplit() and what you want to 
 inject
 I guess would work for replacing but that's really not very
 elegant. remove() is simply way too cumbersome to use. I guess
 you could use indexOf, then indexOf again with a slice with the
 first result, then pass both two a tuple in remove. That's
 terrible though.

 I think the mutation algorithms in std.algorithm can handle 
 most of these when used in conjunction with other algorithms, 
 except that narrow strings do not have the property of 
 assignable elements, which is kind of a fatal blow.

Something needs to be done about this. I'm not sure what.

Jan 11 2014

Michel Fortin <michel.fortin michelf.ca> writes:

On 2014-01-11 07:50:54 +0000, "Brad Anderson" <eco gnuk.net> said:

 1. They have case-insensitive versions of pretty much everything.
 It's not hard to do a map!toLower/toUpper in D but it's also not
 obvious (nor do I know if that's actually correct in languages
 outside of english).

Uppercase, lowercase, and case-insensitive comparison is 
locale-dependent for Unicode. In the general case you can't just 
compare the lowercase/uppercase versions. For instance, look at the 
Turkish i/İ and ı/I (dot-less i), or the German ß/SS ss/SS pairs. Also, 
if you're sorting in alphabetical order you probably want to do 
something special with diacritics.

The correct way to to this is to implement the Unicode Collation Algorithm:
http://www.unicode.org/reports/tr10/

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Jan 11 2014

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 11 January 2014 at 12:47:12 UTC, Michel Fortin wrote:
 On 2014-01-11 07:50:54 +0000, "Brad Anderson" <eco gnuk.net> 
 said:

 1. They have case-insensitive versions of pretty much 
 everything.
 It's not hard to do a map!toLower/toUpper in D but it's also 
 not
 obvious (nor do I know if that's actually correct in languages
 outside of english).

 Uppercase, lowercase, and case-insensitive comparison is 
 locale-dependent for Unicode. In the general case you can't 
 just compare the lowercase/uppercase versions. For instance, 
 look at the Turkish i/İ and ı/I (dot-less i), or the German 
 ß/SS ss/SS pairs. Also, if you're sorting in alphabetical order 
 you probably want to do something special with diacritics.

 The correct way to to this is to implement the Unicode 
 Collation Algorithm:
 http://www.unicode.org/reports/tr10/

I thought it was probably more complicated than that.

Looks like Dmitry put it in the tracker:
http://d.puremagic.com/issues/show_bug.cgi?id=10566

Jan 11 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Saturday, 11 January 2014 at 18:14:24 UTC, Brad Anderson wrote:
 On Saturday, 11 January 2014 at 12:47:12 UTC, Michel Fortin
 The correct way to to this is to implement the Unicode 
 Collation Algorithm:
 http://www.unicode.org/reports/tr10/

 I thought it was probably more complicated than that.

You should read the report...

Jan 11 2014

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 11 January 2014 at 18:56:53 UTC, monarch_dodra wrote:
 On Saturday, 11 January 2014 at 18:14:24 UTC, Brad Anderson 
 wrote:
 On Saturday, 11 January 2014 at 12:47:12 UTC, Michel Fortin
 The correct way to to this is to implement the Unicode 
 Collation Algorithm:
 http://www.unicode.org/reports/tr10/

 I thought it was probably more complicated than that.

 You should read the report...

I meant more complicated than toLower. I'm already plenty
intimidated by Unicode publications :)

Jan 11 2014

Jacob Carlborg <doob me.com> writes:

On 2014-01-11 08:50, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

toLower/Upper doesn't really work in place.

-- 
/Jacob Carlborg

Jan 11 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Saturday, 11 January 2014 at 20:36:31 UTC, Jacob Carlborg 
wrote:
 On 2014-01-11 08:50, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

 toLower/Upper doesn't really work in place.

Yeah, "toLowerInplace" is actually more like 
"toLowerProbablyInPlace"

Jan 11 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2014 01:22, monarch_dodra пишет:
 On Saturday, 11 January 2014 at 20:36:31 UTC, Jacob Carlborg wrote:
 On 2014-01-11 08:50, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

 toLower/Upper doesn't really work in place.

 Yeah, "toLowerInplace" is actually more like "toLowerProbablyInPlace"

With high probablity :)

And it's indeed quite high, the amount of "bad sheep" that gets 
longer/shorter across the whole Unicode is around 5-10 codepoints IRC.

-- 
Dmitry Olshansky

Jan 11 2014

Jacob Carlborg <doob me.com> writes:

On 2014-01-11 22:42, Dmitry Olshansky wrote:

 With high probablity :)

 And it's indeed quite high, the amount of "bad sheep" that gets
 longer/shorter across the whole Unicode is around 5-10 codepoints IRC.

The least we can do is make that very clear in the documentation.

-- 
/Jacob Carlborg

Jan 12 2014

"Tobias Pankrath" <tobias pankrath.net> writes:

On Saturday, 11 January 2014 at 21:42:46 UTC, Dmitry Olshansky 
wrote:
 12-Jan-2014 01:22, monarch_dodra пишет:
 On Saturday, 11 January 2014 at 20:36:31 UTC, Jacob Carlborg 
 wrote:
 On 2014-01-11 08:50, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

 toLower/Upper doesn't really work in place.

 Yeah, "toLowerInplace" is actually more like 
 "toLowerProbablyInPlace"

 With high probablity :)

 And it's indeed quite high, the amount of "bad sheep" that gets 
 longer/shorter across the whole Unicode is around 5-10 
 codepoints IRC.

More important than the absolute amount of "bad sheep" is the 
frequency of them in your input :-)

Jan 12 2014

"Dominikus Dittes Scherkl" writes:

On Sunday, 12 January 2014 at 12:48:05 UTC, Tobias Pankrath wrote:
 On Saturday, 11 January 2014 at 21:42:46 UTC, Dmitry Olshansky 
 wrote:
 12-Jan-2014 01:22, monarch_dodra пишет:
 And it's indeed quite high, the amount of "bad sheep" that 
 gets longer/shorter across the whole Unicode is around 5-10 
 codepoints IRC.

 More important than the absolute amount of "bad sheep" is the 
 frequency of them in your input :-)

In german the frequency of "ß" is 0.31% and the mess with getting 
a longer
result ("SS") is only for toUpper().
I think greak has a similar problem but don't know the frequency 
there...

Jan 13 2014

Michel Fortin <michel.fortin michelf.ca> writes:

On 2014-01-13 17:15:21 +0000, "Dominikus Dittes Scherkl" 
<Dominikus.Scherkl continental-corporation.com> said:

 On Sunday, 12 January 2014 at 12:48:05 UTC, Tobias Pankrath wrote:
 On Saturday, 11 January 2014 at 21:42:46 UTC, Dmitry Olshansky wrote:
 12-Jan-2014 01:22, monarch_dodra пишет:
 And it's indeed quite high, the amount of "bad sheep" that gets 
 longer/shorter across the whole Unicode is around 5-10 codepoints IRC.

 
 More important than the absolute amount of "bad sheep" is the frequency 
 of them in your input :-)

 
 In german the frequency of "ß" is 0.31% and the mess with getting a longer
 result ("SS") is only for toUpper().
 I think greak has a similar problem but don't know the frequency there...

The funny thing about "ß" is that in UTF-8 it's two bytes (0xC3 0x9F) 
and you replace it with "SS" which is two bytes too (0x53 0x53). So 
with some cleverness it can be done in place for char[], but not for 
wchar[] or dchar[]. :-)

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Jan 13 2014

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 11 January 2014 at 20:36:31 UTC, Jacob Carlborg
wrote:
 On 2014-01-11 08:50, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

 toLower/Upper doesn't really work in place.

Yeah, it's kind of an argument for and against Phobos/D. InPlace
can't be truly inplace like Boost's is because we have actual
unicode support.

Jan 11 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/10/14 11:50 PM, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

[snip]

Awesome! Shall we create an issue and link the spreadsheet from there?

Andrei

Jan 11 2014

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 11 January 2014 at 20:46:32 UTC, Andrei Alexandrescu
wrote:
 On 1/10/14 11:50 PM, Brad Anderson wrote:
 The recent discussion got me wondering how Phobos stacked up
 against the C++ Boost String Algorithms library.

 Some background on the design of the Boost library:
 http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/design.html

 TL;DR: It works somewhat like ranges.

 Google Spreadsheet with the comparison: http://goo.gl/Wmotu4

 [snip]

 Awesome! Shall we create an issue and link the spreadsheet from 
 there?

 Andrei

I'll probably just make an issue for each group of problems after
this is done getting feedback.

The big issues appear to boil down to two things: 1) The complete
inability to do replace/erase functions easily and 2) the lack of
Unicode collation support getting in the way of case-insensitive
operations which are correct in every language.

Number 1 is pretty serious for day to day coding. Number 2 would
just fill a hole in our otherwise excellent Unicode support
(something Boost doesn't even truly have, instead using locales
and character sets). In the meantime, for English and a few other
languages what we have already can be used to perform
case-insensitive operations.

Jan 11 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Phobos strings versus C++ Boost