digitalmars.D.learn - A better way to write this function? (style question)
- Thomas Gann (23/23) Dec 30 2013 I've written a Markov bot in D, and I have function whose job it
- John Colvin (18/41) Dec 30 2013 A few points:
- John Colvin (2/48) Dec 30 2013 sorry, ignore that attempt, it's woefully broken...
- Brad Anderson (5/60) Dec 30 2013 Re: weird literal syntax, you didn't happen to be using dpaste to
- John Colvin (25/88) Dec 30 2013 oohhh so that was what that was. Anyway, here's what I have so
- Jacob Carlborg (5/8) Dec 31 2013 There's a "byGrapheme" and a "byCodePoint" function in std.uni. It was
- Brad Anderson (17/40) Dec 30 2013 Not a huge improvement and it could be a lot faster, no doubt:
- bearophile (5/10) Dec 30 2013 Take a look at the ways I have used here:
- Thomas Gann (19/19) Dec 30 2013 Thanks for all your replies, guys! I have done some further
I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?
Dec 30 2013
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
Dec 30 2013
On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:sorry, ignore that attempt, it's woefully broken...I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
Dec 30 2013
On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you? Because I did and thought I was going insane until realized DPaste was broken.On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:sorry, ignore that attempt, it's woefully broken...I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
Dec 30 2013
On Monday, 30 December 2013 at 22:38:43 UTC, Brad Anderson wrote:On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:oohhh so that was what that was. Anyway, here's what I have so far: import std.range : isInputRange; auto splitSentence(R)(R input) if(isInputRange!R) { import std.algorithm : map, splitter, filter; import std.uni : toLower; import std.range : ElementType; dchar preProc(ElementType!R c) { return (c == '\n' || c == '\r') ? ' ' : c.toLower; } return input .map!preProc .splitter!(c => c == ' ') .filter!(a => !(a.empty)); } I have to have the function instead of a lamda in order to get the return type correct. I guess I could cast or use std.conv.to instead. Anyway, the big problem I've hit is that AFAICT std.algorithm makes a complete mess of unicode and i can't find a byCodeUnit range anywhere in order to make it correct.On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you? Because I did and thought I was going insane until realized DPaste was broken.On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:sorry, ignore that attempt, it's woefully broken...I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
Dec 30 2013
On 2013-12-30 23:51, John Colvin wrote:Anyway, the big problem I've hit is that AFAICT std.algorithm makes a complete mess of unicode and i can't find a byCodeUnit range anywhere in order to make it correct.There's a "byGrapheme" and a "byCodePoint" function in std.uni. It was recently added. -- /Jacob Carlborg
Dec 31 2013
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace. Here is the function is question: string[] split_sentence(string input) { string line; foreach(c; input) { if(c == '\n' || c == '\r') line ~= ' '; else line ~= c.toLower(); } return line.splitter(' ').filter!(a => a.length).array; } Obviously, one issue is that because the string is immutable, I can't modify it directly, and so I actually build an entirely new string in place. I would have just made a mutable duplicate of the input and modify that, but then I would get errors returning, because it expects string[] and not char[][]. Is there a more elegant way to do what I'm doing?Not a huge improvement and it could be a lot faster, no doubt: string[] split_sentence(string input) { import std.regex; auto split_re = ctRegex!(r"[\n\r ]"); return input.split(split_re) .map!toLower .filter!(a => !a.empty) .array(); } You could return the result of filter instead of an array to make it lazy and avoid an allocation if the caller is able to use it in that form. toLower had some really slow and incorrect behavior but I think that was fixed in 2.064. It still might be ASCII only though. I believe std.uni has a toLower that is correct for all of unicode.
Dec 30 2013
Thomas Gann:I've written a Markov bot in D, and I have function whose job it is to take an input string, convert all newline characters to spaces and all uppercase letters to lowercase, and then return an array of words that are generated by splitting the string up by whitespace.Take a look at the ways I have used here: http://forum.dlang.org/thread/zdhfpftodxnvbpwvklcv forum.dlang.org Bye, bearophile
Dec 30 2013
Thanks for all your replies, guys! I have done some further research in the meantime and I have found out that I am, in fact, an idiot. There is actually a standard library function that does exactly what I am trying to do! As it turns out, std.string.split(): 1) It automatically discards empty tokens, so there is no need to filter it manually. 2) Its definition of whitespace includes spaces, tabs, and both types of newlines, so there is no need to convert newlines to spaces. An equivalent, but much less verbose, form of the function becomes the following: input.toLower.split; So I'm noticing a common theme here, though, which is that lots of operations on arrays is bad? Should I be using ranges pretty much everywhere, then? Or just in performance-critical areas of code where I want to avoid a lot of copying? Would that just leave plain vanilla dynamic arrays to the task of (relatively) permanent storage?
Dec 30 2013