www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - phobos and splitting things... but not with whitespace.

reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
http://dlang.org/phobos/std_array.html#splitter

The first thing I don't understand is why splitter is in /std.array/ and 
yet only works on /strings/.  It is defined in terms of whitespace, and 
I don't understand how whitespace is well-defined for things besides 
text.  Why wouldn't it be in std.string?

That said, I'd like to split on something that isn't whitespace.  So 
where's "auto splitter(C)(C[] s, C[] delim)"??  Is there a hole in 
functionality?

The next thing I want to do is split on whitespace, but only once, and 
recover the tail.  I want to write this function:

string snip(string text)
{
	string head, tail;
	head = getHead(text, "// -- snip --", tail);
	return tail;
}

I would expect these functions to exist:
auto getHead(C)(C[] s, C[] delim, ref C[] tail);
auto getHead(C)(C[] s, C[] delim);
auto getTail(C)(C[] s, C[] delim);

Maybe even this, though it could be a bit redundant:
auto getTail(C)(C[] s, C[] delim, ref C[] head);

Do these exist in phobos?  Otherwise, is it a hole in the functionality 
or some kind of intentional design minimalism?
Jun 23 2012
next sibling parent reply simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 17:19:59 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:
 http://dlang.org/phobos/std_array.html#splitter

 The first thing I don't understand is why splitter is in /std.array/ and  
 yet only works on /strings/.  It is defined > in terms of whitespace,  
 and I don't understand how whitespace is well-defined for things besides  
 text.  Why wouldn't > it be in std.string?

See http://dlang.org/phobos/std_algorithm.html#splitter
 I would expect these functions to exist:
 auto getHead(C)(C[] s, C[] delim, ref C[] tail);
 auto getHead(C)(C[] s, C[] delim);
 auto getTail(C)(C[] s, C[] delim);

As head is simply splitter(..)[0] and tail splitter(...)[1..$], extra functions could be implemented much like this property T head(T[] arr) { return arr.front; } property T[] tail(T[] arr) { return arr[1..$]; } ..and UFCS takes care of the rest: auto fields = splitter(...); auto head = fields.head; auto tail = fields.tail;
Jun 23 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 11:31 AM, simendsjo wrote:
 On Sat, 23 Jun 2012 17:19:59 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:
 http://dlang.org/phobos/std_array.html#splitter

 The first thing I don't understand is why splitter is in /std.array/
 and yet only works on /strings/. It is defined > in terms of
 whitespace, and I don't understand how whitespace is well-defined for
 things besides text. Why wouldn't > it be in std.string?

See http://dlang.org/phobos/std_algorithm.html#splitter
 I would expect these functions to exist:
 auto getHead(C)(C[] s, C[] delim, ref C[] tail);
 auto getHead(C)(C[] s, C[] delim);
 auto getTail(C)(C[] s, C[] delim);

As head is simply splitter(..)[0] and tail splitter(...)[1..$], extra functions could be implemented much like this property T head(T[] arr) { return arr.front; } property T[] tail(T[] arr) { return arr[1..$]; } ..and UFCS takes care of the rest: auto fields = splitter(...); auto head = fields.head; auto tail = fields.tail;

But I don't want tail as an array. Assume that arr is HUGE and scanning the rest of it is a bad idea. join(arr[1..$]) then becomes a slow operation: O(n) when I could have O(1).
Jun 23 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 11:44 AM, simendsjo wrote:
 On Sat, 23 Jun 2012 17:39:55 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 On 06/23/2012 11:31 AM, simendsjo wrote:
 On Sat, 23 Jun 2012 17:19:59 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:
 http://dlang.org/phobos/std_array.html#splitter

 The first thing I don't understand is why splitter is in /std.array/
 and yet only works on /strings/. It is defined > in terms of
 whitespace, and I don't understand how whitespace is well-defined for
 things besides text. Why wouldn't > it be in std.string?

See http://dlang.org/phobos/std_algorithm.html#splitter
 I would expect these functions to exist:
 auto getHead(C)(C[] s, C[] delim, ref C[] tail);
 auto getHead(C)(C[] s, C[] delim);
 auto getTail(C)(C[] s, C[] delim);

As head is simply splitter(..)[0] and tail splitter(...)[1..$], extra functions could be implemented much like this property T head(T[] arr) { return arr.front; } property T[] tail(T[] arr) { return arr[1..$]; } ..and UFCS takes care of the rest: auto fields = splitter(...); auto head = fields.head; auto tail = fields.tail;

But I don't want tail as an array. Assume that arr is HUGE and scanning the rest of it is a bad idea. join(arr[1..$]) then becomes a slow operation: O(n) when I could have O(1).

Looking for findSplit? http://dlang.org/phobos/std_algorithm.html#findSplit

Cool, that's what I want! Now if I could find the elegant way to remove exactly one line from the text without scanning the text after it...
Jun 23 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 01:02 PM, simendsjo wrote:
 On Sat, 23 Jun 2012 18:56:24 +0200, simendsjo <simendsjo gmail.com> wrote:

 On Sat, 23 Jun 2012 18:50:05 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 Looking for findSplit?
 http://dlang.org/phobos/std_algorithm.html#findSplit
 Cool, that's what I want!
 Now if I could find the elegant way to remove exactly one line from
 the text without scanning the text after it...

Isn't that exactly what findSplit does? It doesn't have to search the rest of the string after the match, it just returns a slice of the rest of the array (I guess - haven't read the code)

import std.stdio, std.algorithm; void main() { auto text = "1\n2\n3\n4"; auto res = text.findSplit("\n"); auto pre = res[0]; assert(pre.ptr == text.ptr); // no copy for pre match auto match = res[1]; assert(match.ptr == &text[1]); // no copy for needle auto post = res[2]; assert(post.ptr == &text[2]); // no copy for post match assert(post.length == 5); }

Close... the reason findSplit doesn't work is because a new line could be "\n" or it could be "\r\n" or it could be "\r".
Jun 23 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 01:24 PM, Chad J wrote:
 On 06/23/2012 01:02 PM, simendsjo wrote:
 On Sat, 23 Jun 2012 18:56:24 +0200, simendsjo <simendsjo gmail.com>
 wrote:

 On Sat, 23 Jun 2012 18:50:05 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 Looking for findSplit?
 http://dlang.org/phobos/std_algorithm.html#findSplit
 Cool, that's what I want!
 Now if I could find the elegant way to remove exactly one line from
 the text without scanning the text after it...

Isn't that exactly what findSplit does? It doesn't have to search the rest of the string after the match, it just returns a slice of the rest of the array (I guess - haven't read the code)

import std.stdio, std.algorithm; void main() { auto text = "1\n2\n3\n4"; auto res = text.findSplit("\n"); auto pre = res[0]; assert(pre.ptr == text.ptr); // no copy for pre match auto match = res[1]; assert(match.ptr == &text[1]); // no copy for needle auto post = res[2]; assert(post.ptr == &text[2]); // no copy for post match assert(post.length == 5); }

Close... the reason findSplit doesn't work is because a new line could be "\n" or it could be "\r\n" or it could be "\r".

As an additional note: I could probably do this easily if I had a function like findSplit where the predicate is used /instead/ of a delimiter. So like this: auto findSplit(alias pred = "a", R)(R haystack); ... auto tuple = findSplit!(`a == "\n" || a == "\r\n" || a == "\r"`)(text); return tuple[2];
Jun 23 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 02:17 PM, simendsjo wrote:
 On Sat, 23 Jun 2012 19:52:32 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 As an additional note: I could probably do this easily if I had a
 function like findSplit where the predicate is used /instead/ of a
 delimiter. So like this:
 auto findSplit(alias pred = "a", R)(R haystack);
 ...
 auto tuple = findSplit!(`a == "\n" || a == "\r\n" || a == "\r"`)(text);
 return tuple[2];

I don't think it can match on ranges, but it's pretty trivial to implement something that would work for your case import std.array, std.algorithm, std.typecons; auto newlineSplit(string data) { auto rest = data.findAmong("\r\n"); if(!rest.empty) { // found auto pre = data[0..data.length-rest.length]; string match; if(rest.front == '\r' && (rest.length > 1 && rest[1] == '\n')) { // \r\n match = rest[0..2]; rest = rest[2..$]; } else { // \r or \n match = rest[0..1]; rest = rest[1..$]; } return tuple(pre, match, rest); } else { return tuple(data, "", ""); } } unittest { auto text = "1\n2\r\n3\r4"; auto res = text.newlineSplit(); assert(res[0] == "1"); assert(res[1] == "\n"); assert(res[2] == "2\r\n3\r4"); res = res[2].newlineSplit(); assert(res[0] == "2"); assert(res[1] == "\r\n"); assert(res[2] == "3\r4"); res = res[2].newlineSplit(); assert(res[0] == "3"); assert(res[1] == "\r"); assert(res[2] == "4"); res = res[2].newlineSplit(); assert(res[0] == "4"); assert(res[1] == ""); assert(res[2] == ""); }

Hey, thanks for doing all of that. I didn't expect you to write all of that. Once I've established that the issue isn't just a lack of learning on my part, my subsequent objective is filling any missing functionality in phobos. IMO the "take away a single line" thing should be accomplishable with a single concise expression. Then there should be a function in std.string that contains that single expression and wraps it in easy-to-find documentation. This kind of thing is a fairly common operation. Otherwise, I find it odd that there is a function to split up an arbitrary number of lines but no function to split off only one! Also, any function that works with whitespace should have versions/variants that work with arbitrary delimiters. Not unless it is impossible to generalize it that way for some reason. If the variants are found in a separate module, then the documentation should reference them.
Jun 23 2012
next sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 02:53 PM, simendsjo wrote:
 On Sat, 23 Jun 2012 20:41:29 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 Hey, thanks for doing all of that. I didn't expect you to write all of
 that.

 Once I've established that the issue isn't just a lack of learning on
 my part, my subsequent objective is filling any missing functionality
 in phobos. IMO the "take away a single line" thing should be
 accomplishable with a single concise expression. Then there should be
 a function in std.string that contains that single expression and
 wraps it in easy-to-find documentation. This kind of thing is a fairly
 common operation. Otherwise, I find it odd that there is a function to
 split up an arbitrary number of lines but no function to split off
 only one!
 Also, any function that works with whitespace should have
 versions/variants that work with arbitrary delimiters. Not unless it
 is impossible to generalize it that way for some reason. If the
 variants are found in a separate module, then the documentation should
 reference them.

The problem here is there isn't a version of findSplit only taking a predicate and not a needle. If it had an overload just taking a function, you could have solved it by writing: auto res = myText.findSplit!(a => a.startsWith("\r\n", "\n", "\r"));

True, although I'm a bigger fan of the compile-time alias predicate because of it's superior inline-ability. ;)
Jun 23 2012
prev sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 06/23/2012 03:41 PM, simendsjo wrote:
 On Sat, 23 Jun 2012 20:41:29 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:

 IMO the "take away a single line" thing should be accomplishable with
 a single concise expression

This takes a range to match against, so much like startsWith: auto findSplitAny(Range, Ranges...)(Range data, Ranges matches) { auto rest = data; for(; !rest.empty; rest.popFront()) { foreach(match; matches) { if(rest.startsWith(match)) { auto restStart = data.length-rest.length; auto pre = data[0..restStart]; // we'll fetch it from the data instead of using the supplied // match to be consistent with findSplit auto dataMatch = data[restStart..restStart+match.length]; auto post = rest[match.length..$]; return tuple(pre, dataMatch, post); } } } return tuple(data, Range.init, Range.init); } unittest { auto text = "1\n2\r\n3\r4"; auto res = text.findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "1"); assert(res[1] == "\n"); assert(res[2] == "2\r\n3\r4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "2"); assert(res[1] == "\r\n"); assert(res[2] == "3\r4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "3"); assert(res[1] == "\r"); assert(res[2] == "4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "4"); assert(res[1] == ""); assert(res[2] == ""); }

I, for one, would like to see that in phobos... Although it should probably be called findSplitAmong to be consistent with findAmong ;)
Jun 23 2012
prev sibling next sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
I'm realizing that if I want to remove exactly one line from a string of 
text and make no assumptions about the type of newline ("\n" or "\r\n" 
or "\r") and without scanning the rest of the text then I'm not sure how 
to do this with a single call to phobos functions.  I'd have to use 
indexOf and do a bunch of twiddling and maybe look ahead a character. 
It seems unusually complicated for such a simple operation.
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 17:39:55 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:

 On 06/23/2012 11:31 AM, simendsjo wrote:
 On Sat, 23 Jun 2012 17:19:59 +0200, Chad J
 <chadjoan __spam.is.bad__gmail.com> wrote:
 http://dlang.org/phobos/std_array.html#splitter

 The first thing I don't understand is why splitter is in /std.array/
 and yet only works on /strings/. It is defined > in terms of
 whitespace, and I don't understand how whitespace is well-defined for
 things besides text. Why wouldn't > it be in std.string?

See http://dlang.org/phobos/std_algorithm.html#splitter
 I would expect these functions to exist:
 auto getHead(C)(C[] s, C[] delim, ref C[] tail);
 auto getHead(C)(C[] s, C[] delim);
 auto getTail(C)(C[] s, C[] delim);

As head is simply splitter(..)[0] and tail splitter(...)[1..$], extra functions could be implemented much like this property T head(T[] arr) { return arr.front; } property T[] tail(T[] arr) { return arr[1..$]; } ..and UFCS takes care of the rest: auto fields = splitter(...); auto head = fields.head; auto tail = fields.tail;

But I don't want tail as an array. Assume that arr is HUGE and scanning the rest of it is a bad idea. join(arr[1..$]) then becomes a slow operation: O(n) when I could have O(1).

Looking for findSplit? http://dlang.org/phobos/std_algorithm.html#findSplit
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 18:50:05 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:

 Looking for findSplit?  
 http://dlang.org/phobos/std_algorithm.html#findSplit
  Cool, that's what I want!
  Now if I could find the elegant way to remove exactly one line from the  
 text without scanning the text after it...

Isn't that exactly what findSplit does? It doesn't have to search the rest of the string after the match, it just returns a slice of the rest of the array (I guess - haven't read the code)
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 18:56:24 +0200, simendsjo <simendsjo gmail.com> wrote:

 On Sat, 23 Jun 2012 18:50:05 +0200, Chad J  
 <chadjoan __spam.is.bad__gmail.com> wrote:

 Looking for findSplit?  
 http://dlang.org/phobos/std_algorithm.html#findSplit
  Cool, that's what I want!
  Now if I could find the elegant way to remove exactly one line from  
 the text without scanning the text after it...

Isn't that exactly what findSplit does? It doesn't have to search the rest of the string after the match, it just returns a slice of the rest of the array (I guess - haven't read the code)

import std.stdio, std.algorithm; void main() { auto text = "1\n2\n3\n4"; auto res = text.findSplit("\n"); auto pre = res[0]; assert(pre.ptr == text.ptr); // no copy for pre match auto match = res[1]; assert(match.ptr == &text[1]); // no copy for needle auto post = res[2]; assert(post.ptr == &text[2]); // no copy for post match assert(post.length == 5); }
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 19:52:32 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:

 As an additional note: I could probably do this easily if I had a  
 function like findSplit where the predicate is used /instead/ of a  
 delimiter.  So like this:
  auto findSplit(alias pred = "a", R)(R haystack);
 ...
 auto tuple = findSplit!(`a == "\n" || a == "\r\n" || a == "\r"`)(text);
 return tuple[2];

I don't think it can match on ranges, but it's pretty trivial to implement something that would work for your case import std.array, std.algorithm, std.typecons; auto newlineSplit(string data) { auto rest = data.findAmong("\r\n"); if(!rest.empty) { // found auto pre = data[0..data.length-rest.length]; string match; if(rest.front == '\r' && (rest.length > 1 && rest[1] == '\n')) { // \r\n match = rest[0..2]; rest = rest[2..$]; } else { // \r or \n match = rest[0..1]; rest = rest[1..$]; } return tuple(pre, match, rest); } else { return tuple(data, "", ""); } } unittest { auto text = "1\n2\r\n3\r4"; auto res = text.newlineSplit(); assert(res[0] == "1"); assert(res[1] == "\n"); assert(res[2] == "2\r\n3\r4"); res = res[2].newlineSplit(); assert(res[0] == "2"); assert(res[1] == "\r\n"); assert(res[2] == "3\r4"); res = res[2].newlineSplit(); assert(res[0] == "3"); assert(res[1] == "\r"); assert(res[2] == "4"); res = res[2].newlineSplit(); assert(res[0] == "4"); assert(res[1] == ""); assert(res[2] == ""); }
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 20:41:29 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:

 Hey, thanks for doing all of that.  I didn't expect you to write all of  
 that.

  Once I've established that the issue isn't just a lack of learning on  
 my part, my subsequent objective is filling any missing functionality in  
 phobos.  IMO the "take away a single line" thing should be  
 accomplishable with a single concise expression.  Then there should be a  
 function in std.string that contains that single expression and wraps it  
 in easy-to-find documentation.  This kind of thing is a fairly common  
 operation.  Otherwise, I find it odd that there is a function to split  
 up an arbitrary number of lines but no function to split off only one!
  Also, any function that works with whitespace should have  
 versions/variants that work with arbitrary delimiters.  Not unless it is  
 impossible to generalize it that way for some reason.  If the variants  
 are found in a separate module, then the documentation should reference  
 them.

The problem here is there isn't a version of findSplit only taking a predicate and not a needle. If it had an overload just taking a function, you could have solved it by writing: auto res = myText.findSplit!(a => a.startsWith("\r\n", "\n", "\r"));
Jun 23 2012
prev sibling next sibling parent simendsjo <simendsjo gmail.com> writes:
On Sat, 23 Jun 2012 20:41:29 +0200, Chad J  
<chadjoan __spam.is.bad__gmail.com> wrote:

 IMO the "take away a single line" thing should be accomplishable with a  
 single concise expression

This takes a range to match against, so much like startsWith: auto findSplitAny(Range, Ranges...)(Range data, Ranges matches) { auto rest = data; for(; !rest.empty; rest.popFront()) { foreach(match; matches) { if(rest.startsWith(match)) { auto restStart = data.length-rest.length; auto pre = data[0..restStart]; // we'll fetch it from the data instead of using the supplied // match to be consistent with findSplit auto dataMatch = data[restStart..restStart+match.length]; auto post = rest[match.length..$]; return tuple(pre, dataMatch, post); } } } return tuple(data, Range.init, Range.init); } unittest { auto text = "1\n2\r\n3\r4"; auto res = text.findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "1"); assert(res[1] == "\n"); assert(res[2] == "2\r\n3\r4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "2"); assert(res[1] == "\r\n"); assert(res[2] == "3\r4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "3"); assert(res[1] == "\r"); assert(res[2] == "4"); res = res[2].findSplitAny("\r\n", "\n", "\r"); assert(res[0] == "4"); assert(res[1] == ""); assert(res[2] == ""); }
Jun 23 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
Just found a follow-up post: 
http://dblog.aldacron.net/2012/06/24/my-only-gripes-about-d/
Jun 24 2012
prev sibling parent simendsjo <simendsjo gmail.com> writes:
On Sun, 24 Jun 2012 10:02:07 +0200, Roman D. Boiko <rb d-coding.com> wrote:

 Just found a follow-up post:  
 http://dblog.aldacron.net/2012/06/24/my-only-gripes-about-d/

Just found it myself. RSS for the win :) I can't say I disagree. You have to read through several modules to find what you need: std.string, std.range, std.array, std.algorithm (and, hopefully soon, std.collection)
Jun 24 2012