www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - A better way to write this function? (style question)

reply "Thomas Gann" <thomasleegann gmail.com> writes:
I've written a Markov bot in D, and I have function whose job it 
is to take an input string, convert all newline characters to 
spaces and all uppercase letters to lowercase, and then return an 
array of words that are generated by splitting the string up by 
whitespace. Here is the function is question:

string[] split_sentence(string input)
{
     string line;

     foreach(c; input)
     {
         if(c == '\n' || c == '\r')
             line ~= ' ';

         else
             line ~= c.toLower();
     }

     return line.splitter(' ').filter!(a => a.length).array;
}

Obviously, one issue is that because the string is immutable, I 
can't modify it directly, and so I actually build an entirely new 
string in place. I would have just made a mutable duplicate of 
the input and modify that, but then I would get errors returning, 
because it expects string[] and not char[][]. Is there a more 
elegant way to do what I'm doing?
Dec 30 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
 I've written a Markov bot in D, and I have function whose job 
 it is to take an input string, convert all newline characters 
 to spaces and all uppercase letters to lowercase, and then 
 return an array of words that are generated by splitting the 
 string up by whitespace. Here is the function is question:

 string[] split_sentence(string input)
 {
     string line;

     foreach(c; input)
     {
         if(c == '\n' || c == '\r')
             line ~= ' ';

         else
             line ~= c.toLower();
     }

     return line.splitter(' ').filter!(a => a.length).array;
 }

 Obviously, one issue is that because the string is immutable, I 
 can't modify it directly, and so I actually build an entirely 
 new string in place. I would have just made a mutable duplicate 
 of the input and modify that, but then I would get errors 
 returning, because it expects string[] and not char[][]. Is 
 there a more elegant way to do what I'm doing?
A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
Dec 30 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
 On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
 I've written a Markov bot in D, and I have function whose job 
 it is to take an input string, convert all newline characters 
 to spaces and all uppercase letters to lowercase, and then 
 return an array of words that are generated by splitting the 
 string up by whitespace. Here is the function is question:

 string[] split_sentence(string input)
 {
    string line;

    foreach(c; input)
    {
        if(c == '\n' || c == '\r')
            line ~= ' ';

        else
            line ~= c.toLower();
    }

    return line.splitter(' ').filter!(a => a.length).array;
 }

 Obviously, one issue is that because the string is immutable, 
 I can't modify it directly, and so I actually build an 
 entirely new string in place. I would have just made a mutable 
 duplicate of the input and modify that, but then I would get 
 errors returning, because it expects string[] and not 
 char[][]. Is there a more elegant way to do what I'm doing?
A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
sorry, ignore that attempt, it's woefully broken...
Dec 30 2013
parent reply "Brad Anderson" <eco gnuk.net> writes:
On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
 On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
 On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
 I've written a Markov bot in D, and I have function whose job 
 it is to take an input string, convert all newline characters 
 to spaces and all uppercase letters to lowercase, and then 
 return an array of words that are generated by splitting the 
 string up by whitespace. Here is the function is question:

 string[] split_sentence(string input)
 {
   string line;

   foreach(c; input)
   {
       if(c == '\n' || c == '\r')
           line ~= ' ';

       else
           line ~= c.toLower();
   }

   return line.splitter(' ').filter!(a => a.length).array;
 }

 Obviously, one issue is that because the string is immutable, 
 I can't modify it directly, and so I actually build an 
 entirely new string in place. I would have just made a 
 mutable duplicate of the input and modify that, but then I 
 would get errors returning, because it expects string[] and 
 not char[][]. Is there a more elegant way to do what I'm 
 doing?
A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
sorry, ignore that attempt, it's woefully broken...
Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you? Because I did and thought I was going insane until realized DPaste was broken.
Dec 30 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 30 December 2013 at 22:38:43 UTC, Brad Anderson wrote:
 On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
 On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
 On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann 
 wrote:
 I've written a Markov bot in D, and I have function whose 
 job it is to take an input string, convert all newline 
 characters to spaces and all uppercase letters to lowercase, 
 and then return an array of words that are generated by 
 splitting the string up by whitespace. Here is the function 
 is question:

 string[] split_sentence(string input)
 {
  string line;

  foreach(c; input)
  {
      if(c == '\n' || c == '\r')
          line ~= ' ';

      else
          line ~= c.toLower();
  }

  return line.splitter(' ').filter!(a => a.length).array;
 }

 Obviously, one issue is that because the string is 
 immutable, I can't modify it directly, and so I actually 
 build an entirely new string in place. I would have just 
 made a mutable duplicate of the input and modify that, but 
 then I would get errors returning, because it expects 
 string[] and not char[][]. Is there a more elegant way to do 
 what I'm doing?
A few points: by declaring a new string and appending to it you are risking a lot of allocations. Either use std.array.appender or allocate the array with the correct size at the beginning. using .array on the end of the ufcs chain is yet another allocation. It can be avoided using std.algorithm.copy to copy the result back in to 'line' In my opinion the whole API would be better as range-based: auto splitSentence(R)(R input) if(isInputRange!R) { return input .map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' : c.toLower) .splitter!(' ') .filter!(a => !(a.empty)); }
sorry, ignore that attempt, it's woefully broken...
Re: weird literal syntax, you didn't happen to be using dpaste to test and have trouble with character literals, did you? Because I did and thought I was going insane until realized DPaste was broken.
oohhh so that was what that was. Anyway, here's what I have so far: import std.range : isInputRange; auto splitSentence(R)(R input) if(isInputRange!R) { import std.algorithm : map, splitter, filter; import std.uni : toLower; import std.range : ElementType; dchar preProc(ElementType!R c) { return (c == '\n' || c == '\r') ? ' ' : c.toLower; } return input .map!preProc .splitter!(c => c == ' ') .filter!(a => !(a.empty)); } I have to have the function instead of a lamda in order to get the return type correct. I guess I could cast or use std.conv.to instead. Anyway, the big problem I've hit is that AFAICT std.algorithm makes a complete mess of unicode and i can't find a byCodeUnit range anywhere in order to make it correct.
Dec 30 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-12-30 23:51, John Colvin wrote:

 Anyway, the big problem I've hit is that AFAICT std.algorithm makes a
 complete mess of unicode and i can't find a byCodeUnit range anywhere in
 order to make it correct.
There's a "byGrapheme" and a "byCodePoint" function in std.uni. It was recently added. -- /Jacob Carlborg
Dec 31 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann wrote:
 I've written a Markov bot in D, and I have function whose job 
 it is to take an input string, convert all newline characters 
 to spaces and all uppercase letters to lowercase, and then 
 return an array of words that are generated by splitting the 
 string up by whitespace. Here is the function is question:

 string[] split_sentence(string input)
 {
     string line;

     foreach(c; input)
     {
         if(c == '\n' || c == '\r')
             line ~= ' ';

         else
             line ~= c.toLower();
     }

     return line.splitter(' ').filter!(a => a.length).array;
 }

 Obviously, one issue is that because the string is immutable, I 
 can't modify it directly, and so I actually build an entirely 
 new string in place. I would have just made a mutable duplicate 
 of the input and modify that, but then I would get errors 
 returning, because it expects string[] and not char[][]. Is 
 there a more elegant way to do what I'm doing?
Not a huge improvement and it could be a lot faster, no doubt: string[] split_sentence(string input) { import std.regex; auto split_re = ctRegex!(r"[\n\r ]"); return input.split(split_re) .map!toLower .filter!(a => !a.empty) .array(); } You could return the result of filter instead of an array to make it lazy and avoid an allocation if the caller is able to use it in that form. toLower had some really slow and incorrect behavior but I think that was fixed in 2.064. It still might be ASCII only though. I believe std.uni has a toLower that is correct for all of unicode.
Dec 30 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Thomas Gann:

 I've written a Markov bot in D, and I have function whose job 
 it is to take an input string, convert all newline characters 
 to spaces and all uppercase letters to lowercase, and then 
 return an array of words that are generated by splitting the 
 string up by whitespace.
Take a look at the ways I have used here: http://forum.dlang.org/thread/zdhfpftodxnvbpwvklcv forum.dlang.org Bye, bearophile
Dec 30 2013
prev sibling parent "Thomas Gann" <thomasleegann gmail.com> writes:
Thanks for all your replies, guys! I have done some further 
research in the meantime and I have found out that I am, in fact, 
an idiot. There is actually a standard library function that does 
exactly what I am trying to do! As it turns out, 
std.string.split():

1) It automatically discards empty tokens, so there is no need to 
filter it manually.
2) Its definition of whitespace includes spaces, tabs, and both 
types of newlines, so there is no need to convert newlines to 
spaces.

An equivalent, but much less verbose, form of the function 
becomes the following:

     input.toLower.split;

So I'm noticing a common theme here, though, which is that lots 
of operations on arrays is bad? Should I be using ranges pretty 
much everywhere, then? Or just in performance-critical areas of 
code where I want to avoid a lot of copying? Would that just 
leave plain vanilla dynamic arrays to the task of (relatively) 
permanent storage?
Dec 30 2013