digitalmars.D.learn - how to count number of letters with std.algorithm.count /

bioinfornatics (17/17) Nov 16 2012 hi,

bearophile (13/23) Nov 16 2012 If it's speed you look for, then most times std.algorithm is not
Simen Kjaeraas (25/40) Nov 16 2012 There are several problems here. First, as the compiler is trying to

bearophile (8/14) Nov 16 2012 I use a pragmatic approach: I use such higher order functions

Simen Kjaeraas (5/17) Nov 16 2012 Absolutely. But after trying to make sense of the mess presented in the

bioinfornatics (31/31) Nov 17 2012 thanks to all, then something as:

bearophile (9/10) Nov 17 2012 If your purpose is to write a fast function to do this, then I

Sean Kelly (18/29) Nov 23 2012 list of string (string[]) without use a for loop but instead using =

"bioinfornatics" <bioinfornatics fedoraproject.org> writes:

hi,

I would like to count number of one ore more letter into a string 
or list of string (string[]) without use a for loop but instead 
using std.algorithm to compute efficiently.

if you have:
  string   seq1 = "ACGATCGATCGATCGCGCTAGCTAGCTAG";
  string[] seq2 = ["ACGATCGATCGATCGCGCTAGCTAGCTAG", 
"ACGATGACGATCGATGCTAGCTAG"];

i try :

reduce!( (seq) => seq.count("G"), 
seq.count("C"))(tuple(0LU,0LU),seq1)

and got:
Error: undefined identifier seq, did you mean import std?

in morre count seem to request a range then to do multiple count 
into one string it is not easy.

Thanks to show to me how do this

Regards

Nov 16 2012

"bearophile" <bearophileHUGS lycos.com> writes:

bioinfornatics:

 I would like to count number of one ore more letter into a 
 string or list of string (string[]) without use a for loop but 
 instead using std.algorithm to compute efficiently.

If it's speed you look for, then most times std.algorithm is not
the solution. Basic loops are usually faster. Iterating on two
chars at a time with a 2^16 table is probably fast enough for
most purposes.


 if you have:
  string   seq1 = "ACGATCGATCGATCGCGCTAGCTAGCTAG";
  string[] seq2 = ["ACGATCGATCGATCGCGCTAGCTAGCTAG", 
 "ACGATGACGATCGATGCTAGCTAG"];

 i try :

 reduce!( (seq) => seq.count("G"), 
 seq.count("C"))(tuple(0LU,0LU),seq1)

This code is a mess. You are not creating a tuple, you are
scanning the string more than once, and such counting is not a
valid reduction function. So you have to start over and think
better what you want to do, why and how.

What do you mean by counting those in a list of strings? Do you
want the total or an array/range with the partials?

Bye,
bearophile

Nov 16 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-11-16, 16:49, bioinfornatics wrote:

 hi,

 I would like to count number of one ore more letter into a string or  
 list of string (string[]) without use a for loop but instead using  
 std.algorithm to compute efficiently.

 if you have:
   string   seq1 = "ACGATCGATCGATCGCGCTAGCTAGCTAG";
   string[] seq2 = ["ACGATCGATCGATCGCGCTAGCTAGCTAG",  
 "ACGATGACGATCGATGCTAGCTAG"];

 i try :

 reduce!( (seq) => seq.count("G"), seq.count("C"))(tuple(0LU,0LU),seq1)

 and got:
 Error: undefined identifier seq, did you mean import std?

 in morre count seem to request a range then to do multiple count into  
 one string it is not easy.

 Thanks to show to me how do this

There are several problems here. First, as the compiler is trying to
tell you, the part after the comma is not a valid delegate. It should
look like this:

     reduce!( (seq) => seq.count("G"), (seq) =>  
seq.count("C"))(tuple(0LU,0LU),seq1)

Now, that's not really all that closer to the goal. The parameters to
these delegates (seq) are characters, not arrays of characters. Thus,
seq.count does not do what you want.

Next iteration would be:

     reduce!( (seq) => seq == "G", (seq) => seq == "C" )(tuple, 0LU, 0LU,  
seq1)

Not there yet. seq is an element, "G" is a string - an array. One more:

     reduce!( (seq) => seq == 'G', (seq) => seq == 'G' )(tuple, 0LU, 0LU,  
seq1)

Lastly, reduce expects delegates that take two parameters - the current
value of the accumulator, and the value to be considered:

     reduce!( (acc, seq) => acc + (seq == 'G'), (acc, seq) => acc + (seq ==  
'C') )(tuple(0LU, 0LU), seq1)

There. Now it works, and returns a Tuple!(ulong,ulong)(8, 8).

One thing I think is ugly in my implementation is acc + (seq == 'G'). This
adds a bool and a ulong together. For more points, replace that with
acc + (seq == 'G' ? 1 : 0).

-- 
Simen

Nov 16 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Simen Kjaeraas:

 There. Now it works, and returns a Tuple!(ulong,ulong)(8, 8).

 One thing I think is ugly in my implementation is acc + (seq == 
 'G'). This
 adds a bool and a ulong together. For more points, replace that 
 with
 acc + (seq == 'G' ? 1 : 0).

I use a pragmatic approach: I use such higher order functions 
when they give me some advantage, like more compact code, or 
less-bug-prone code, etc (because they usually don't give me 
faster code). In this case a normal loop that increments two 
counters is faster and far more easy to read and understand :-)

Bye,
bearophile

Nov 16 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-11-16, 17:37, bearophile wrote:

 Simen Kjaeraas:

 There. Now it works, and returns a Tuple!(ulong,ulong)(8, 8).

 One thing I think is ugly in my implementation is acc + (seq == 'G').  
 This
 adds a bool and a ulong together. For more points, replace that with
 acc + (seq == 'G' ? 1 : 0).

 I use a pragmatic approach: I use such higher order functions when they  
 give me some advantage, like more compact code, or less-bug-prone code,  
 etc (because they usually don't give me faster code). In this case a  
 normal loop that increments two counters is faster and far more easy to  
 read and understand :-)

Absolutely. But after trying to make sense of the mess presented in the
OP, I wanted to document the voyage.

-- 
Simen

Nov 16 2012

"bioinfornatics" <bioinfornatics gmail.com> writes:

thanks to all, then something as:
-----------------------------------
Tuple!(size_t[char],size_t) baseCounter( const ref string[] 
sequences ){
         alias Tuple!(size_t[char], size_t) Result;
         Result result;
         size_t         length       = 0;
         size_t[char][] baseNumbers  = new 
size_t[char][](sequences.length);
         foreach(index, ref seq; parallel(sequences)){
             foreach(ref letter; seq ){
                 if( letter in baseNumbers[index]) 
baseNumbers[index][letter]++;
                 else baseNumbers[index][letter] = 1;
             }
             length += seq.length;
         }
         foreach(baseNumber; baseNumbers){
             foreach( key, value; baseNumber ){
                 if( key in result[0] ) result[0][key] += value;
                 else result[0][key] = value;
             }
         }
         result[1] = length;
         return result;
     }
-----------------------------------

same code with highlight http://pastebin.com/pVEuWwgr
should be faster, no ?
This code is generic and should be able to count for any letter 
the frequencies

Nov 17 2012

"bearophile" <bearophileHUGS lycos.com> writes:

bioinfornatics:

 should be faster, no ?

If your purpose is to write a fast function to do this, then I 
suggest you to not use Phobos stuff in this function, and to not 
use associative arrays in it, and possibly to not use the heap. 
This is not going to give you assured faster code, but it's a 
start. And then I suggest to start benchmarking, so you will be 
able to answer your own question :-)

Bye,
bearophile

Nov 17 2012

Sean Kelly <sean invisibleduck.org> writes:

On Nov 16, 2012, at 7:49 AM, bioinfornatics =
<bioinfornatics fedoraproject.org> wrote:

 hi,
=20
 I would like to count number of one ore more letter into a string or =

list of string (string[]) without use a for loop but instead using =
std.algorithm to compute efficiently.
=20
 if you have:
 string   seq1 =3D "ACGATCGATCGATCGCGCTAGCTAGCTAG";
 string[] seq2 =3D ["ACGATCGATCGATCGCGCTAGCTAGCTAG", =

"ACGATGACGATCGATGCTAGCTAG"];
=20
 i try :
=20
 reduce!( (seq) =3D> seq.count("G"), =

seq.count("C"))(tuple(0LU,0LU),seq1)

D has map and reduce but not MapReduce, so this approach feels a bit =
unnatural.  Assuming ASCII characters and a reasonably sized sequence, =
here's the simplest approach:

        auto seq1 =3D cast(byte[])("ACGATCGATCGATCGCGCTAGCTAGCTAG".dup);

	foreach(e; group(sort(seq1))) {
		writefln("%s occurs %s times", cast(char) e[0], e[1]);
	}

For real code, the correct approach really depends on the number of =
discrete values, how dense the set of values is, and the total number of =
elements to evaluate.  For English letters the fastest result is likely =
an int[26].  For a more diverse set of input, a hash table.  For a huge =
input size, something like MapReduce is appropriate.=

Nov 23 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - how to count number of letters with std.algorithm.count /