www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Statistics library

reply dsimcha <dsimcha yahoo.com> writes:
Since there's really no good comprehensive statistics library for D (Tango has
a little bit, the beginnings of a few are on dsource, but nothing much), Ive
been rolling my own statistics functions as necessary.  Almost by accident, it
seems like I've built up the beginnings of a decent statistics library.  I'm
debating whether it might be interesting enough to people to be worth
releasing, and whether enough community help would be available to really make
it production quality, or to merge it with other people's efforts in this
area.  The following functionality is currently available:

Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
tau correlation is a very efficient O(N log N) version.

Mean, standard deviation, variance, kurtosis, percent variance for arrays of
numeric values.

Shannon entropy, mutual information.

Kolmogorov-Smirnov tests

Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
Poisson, binomial PDFs.

Inverse normal distribution, and normally distributed random number generation.

A struct to generate all possible permutations of a sequence.


On the other hand, I'm a scientist, not a full-time programmer, and although I
can write working code, I have no clue what it takes to get code up to the
gold standard of "production."  Also, this library is very D2-dependent, and I
have no interest in back-porting it.  Of course if by some chance someone else
wanted to back-port it, they'd be more than welcome.

Most of the code is covered somehow or another by unit tests, although I
cheated a lot by having some unit tests depend on multiple functions.

Is there any interest in this from others in the D community?  Do other people
think that D would benefit from having a decent statistics library?  Other
comments?
Oct 23 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha, I think the struct to generate permutations is out of place there, and
more fit in a module like the comb (combinatorics) of mine.

Beside that detail, I like the idea of having a standard module with basic
statistics, so I am interested :-)

Bye,
bearophile
Oct 23 2008
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to dsimcha,

 Since there's really no good comprehensive statistics library for D
 (Tango has a little bit, the beginnings of a few are on dsource, but
 nothing much), Ive been rolling my own statistics functions as
 necessary.  Almost by accident, it seems like I've built up the
 beginnings of a decent statistics library.  I'm debating whether it
 might be interesting enough to people to be worth releasing, and
 whether enough community help would be available to really make it
 production quality, or to merge it with other people's efforts in this
 area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)
Oct 23 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from BCS (ao pathlink.com)'s article
 Reply to dsimcha,
 Since there's really no good comprehensive statistics library for D
 (Tango has a little bit, the beginnings of a few are on dsource, but
 nothing much), Ive been rolling my own statistics functions as
 necessary.  Almost by accident, it seems like I've built up the
 beginnings of a decent statistics library.  I'm debating whether it
 might be interesting enough to people to be worth releasing, and
 whether enough community help would be available to really make it
 production quality, or to merge it with other people's efforts in this
 area.

That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Sounds good at least for now. It's only about 1500 lines of code including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.
Oct 24 2008
parent reply BCS <ao pathlink.com> writes:
Reply to dsimcha,

 == Quote from BCS (ao pathlink.com)'s article
 
 Reply to dsimcha,
 
 Since there's really no good comprehensive statistics library for D
 (Tango has a little bit, the beginnings of a few are on dsource, but
 nothing much), Ive been rolling my own statistics functions as
 necessary.  Almost by accident, it seems like I've built up the
 beginnings of a decent statistics library.  I'm debating whether it
 might be interesting enough to people to be worth releasing, and
 whether enough community help would be available to really make it
 production quality, or to merge it with other people's efforts in
 this area.
 

scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

If you don't already have access send me your username and I'll add you.
Oct 24 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from BCS (ao pathlink.com)'s article
 Reply to dsimcha,
 == Quote from BCS (ao pathlink.com)'s article

 Reply to dsimcha,

 Since there's really no good comprehensive statistics library for D
 (Tango has a little bit, the beginnings of a few are on dsource, but
 nothing much), Ive been rolling my own statistics functions as
 necessary.  Almost by accident, it seems like I've built up the
 beginnings of a decent statistics library.  I'm debating whether it
 might be interesting enough to people to be worth releasing, and
 whether enough community help would be available to really make it
 production quality, or to merge it with other people's efforts in
 this area.

scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.


Username: dsimcha Yes, I realize that it's best to do things like this off the newsgroup, but your email address doesn't seem to work.
Oct 25 2008
parent BCS <ao pathlink.com> writes:
Reply to dsimcha,

 Yes, I realize that it's best to do things like this off the
 newsgroup, but your email address doesn't seem to work.
 

Sorry. I figure I get enough SPAM as it is. Besides, there are about 2 dozen other ways to get it if you are persistent enough Oh. Your in, have fun (I really ought to make up some boiler plate like this: http://en.wikipedia.org/wiki/Sudo#Design ;)
Oct 26 2008
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 Since there's really no good comprehensive statistics library for D (Tango has
 a little bit, the beginnings of a few are on dsource, but nothing much), Ive
 been rolling my own statistics functions as necessary.  Almost by accident, it
 seems like I've built up the beginnings of a decent statistics library.  I'm
 debating whether it might be interesting enough to people to be worth
 releasing, and whether enough community help would be available to really make
 it production quality, or to merge it with other people's efforts in this
 area.  The following functionality is currently available:
 
 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
 tau correlation is a very efficient O(N log N) version.
 
 Mean, standard deviation, variance, kurtosis, percent variance for arrays of
 numeric values.
 
 Shannon entropy, mutual information.
 
 Kolmogorov-Smirnov tests
 
 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.
 
 Inverse normal distribution, and normally distributed random number generation.
 
 A struct to generate all possible permutations of a sequence.
 
 
 On the other hand, I'm a scientist, not a full-time programmer, and although I
 can write working code, I have no clue what it takes to get code up to the
 gold standard of "production."  Also, this library is very D2-dependent, and I
 have no interest in back-porting it.  Of course if by some chance someone else
 wanted to back-port it, they'd be more than welcome.
 
 Most of the code is covered somehow or another by unit tests, although I
 cheated a lot by having some unit tests depend on multiple functions.
 
 Is there any interest in this from others in the D community?  Do other people
 think that D would benefit from having a decent statistics library?  Other
 comments?

If the community is interested, I'd be glad to take over your code and put it in Phobos. Andrei
Oct 23 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Andrei,

 dsimcha wrote:
 
 Since there's really no good comprehensive statistics library for D
 (Tango has a little bit, the beginnings of a few are on dsource, but
 nothing much), Ive been rolling my own statistics functions as
 necessary.  Almost by accident, it seems like I've built up the
 beginnings of a decent statistics library.  I'm debating whether it
 might be interesting enough to people to be worth releasing, and
 whether enough community help would be available to really make it
 production quality, or to merge it with other people's efforts in
 this area.  The following functionality is currently available:
 
 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the
 Kendall tau correlation is a very efficient O(N log N) version.
 
 Mean, standard deviation, variance, kurtosis, percent variance for
 arrays of numeric values.
 
 Shannon entropy, mutual information.
 
 Kolmogorov-Smirnov tests
 
 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs,
 hypergeometric, Poisson, binomial PDFs.
 
 Inverse normal distribution, and normally distributed random number
 generation.
 
 A struct to generate all possible permutations of a sequence.
 
 On the other hand, I'm a scientist, not a full-time programmer, and
 although I can write working code, I have no clue what it takes to
 get code up to the gold standard of "production."  Also, this library
 is very D2-dependent, and I have no interest in back-porting it.  Of
 course if by some chance someone else wanted to back-port it, they'd
 be more than welcome.
 
 Most of the code is covered somehow or another by unit tests,
 although I cheated a lot by having some unit tests depend on multiple
 functions.
 
 Is there any interest in this from others in the D community?  Do
 other people think that D would benefit from having a decent
 statistics library?  Other comments?
 

put it in Phobos. Andrei

Even better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.
Oct 23 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from BCS (ao pathlink.com)'s article
 Even better would be getting it in both Phobos and Tango. Shouldn't be hard
 as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in replacement std.bigint (for factorial, although all functions that actually use a factorial are calculated in log space, and therefore don't depend on this), std.algorithm (for swap, isSorted), std.random
Oct 23 2008
parent Don <nospam nospam.com> writes:
dsimcha wrote:
 == Quote from BCS (ao pathlink.com)'s article
 Even better would be getting it in both Phobos and Tango. Shouldn't be hard
 as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included

 Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in
 replacement
 std.bigint (for factorial, although all functions that actually use a factorial
 are calculated in log space, and therefore don't depend on this), std.algorithm
 (for swap, isSorted), std.random

This is another instance where we need a common namespace. Practically nothing in tango.math.* has dependencies on other parts of Tango, other than on the stuff which is now in Core. All the modules you mention would ideally be in the common namespace.
Oct 30 2008
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 If the community is interested, I'd be glad to take over your code and 
 put it in Phobos.

I'm interested.
Oct 23 2008
prev sibling parent reply Dejan Lekic <dejan.lekic tiscali.co.uk> writes:
If my vote counts - I am all for it. :)
Oct 27 2008
parent dsimcha <dsimcha yahoo.com> writes:
Now that I've figured out how the heck to use SVN, it's up on scrapple.
Everything basically has unittests and has been dogfooded by me, but if there
are
any bugs I missed, please file.  I'm actually kind of surprised at the level of
interest in this project.

http://dsource.org/projects/scrapple/browser/trunk/statistics

Just a reminder:  The current version makes pretty heavy use of D2 features, so
it
is completely incompatible with D1.  D1 compatibility was not even a
consideration
in the design of anything.  I have no intention of backporting to D1, since D2
is
getting pretty close to ready anyhow.

Also, I'm not a lawyer, but the intent of the license I put in the header of the
files is to be very permissive to allow integration into Phobos and Tango of
whatever functionality Andrei and Don see fit.  If you see a problem with the
way
I have worded the license, let me know and I'll change it.
Oct 27 2008
prev sibling next sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:
 Since there's really no good comprehensive statistics library for D (Tango has
 a little bit, the beginnings of a few are on dsource, but nothing much), Ive
 been rolling my own statistics functions as necessary.  Almost by accident, it
 seems like I've built up the beginnings of a decent statistics library.  I'm
 debating whether it might be interesting enough to people to be worth
 releasing, and whether enough community help would be available to really make
 it production quality, or to merge it with other people's efforts in this
 area.  The following functionality is currently available:

 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
 tau correlation is a very efficient O(N log N) version.

 Mean, standard deviation, variance, kurtosis, percent variance for arrays of
 numeric values.

 Shannon entropy, mutual information.

 Kolmogorov-Smirnov tests

 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.

 Inverse normal distribution, and normally distributed random number generation.

 A struct to generate all possible permutations of a sequence.

I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb
Oct 23 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Bill Baxter (wbaxter gmail.com)'s article
 On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:
 Since there's really no good comprehensive statistics library for D (Tango has
 a little bit, the beginnings of a few are on dsource, but nothing much), Ive
 been rolling my own statistics functions as necessary.  Almost by accident, it
 seems like I've built up the beginnings of a decent statistics library.  I'm
 debating whether it might be interesting enough to people to be worth
 releasing, and whether enough community help would be available to really make
 it production quality, or to merge it with other people's efforts in this
 area.  The following functionality is currently available:

 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
 tau correlation is a very efficient O(N log N) version.

 Mean, standard deviation, variance, kurtosis, percent variance for arrays of
 numeric values.

 Shannon entropy, mutual information.

 Kolmogorov-Smirnov tests

 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.

 Inverse normal distribution, and normally distributed random number generation.

 A struct to generate all possible permutations of a sequence.

means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.
Oct 23 2008
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Bill Baxter (wbaxter gmail.com)'s article
 Ok, so it's mainly for 1d statistics then?
 --bb

Right.
Oct 23 2008
prev sibling next sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
On Fri, Oct 24, 2008 at 9:39 AM, dsimcha <dsimcha yahoo.com> wrote:
 == Quote from Bill Baxter (wbaxter gmail.com)'s article
 On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:
 Since there's really no good comprehensive statistics library for D (Tango has
 a little bit, the beginnings of a few are on dsource, but nothing much), Ive
 been rolling my own statistics functions as necessary.  Almost by accident, it
 seems like I've built up the beginnings of a decent statistics library.  I'm
 debating whether it might be interesting enough to people to be worth
 releasing, and whether enough community help would be available to really make
 it production quality, or to merge it with other people's efforts in this
 area.  The following functionality is currently available:

 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the     Kendall
 tau correlation is a very efficient O(N log N) version.

 Mean, standard deviation, variance, kurtosis, percent variance for arrays of
 numeric values.

 Shannon entropy, mutual information.

 Kolmogorov-Smirnov tests

 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.

 Inverse normal distribution, and normally distributed random number generation.

 A struct to generate all possible permutations of a sequence.

means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Ok, so it's mainly for 1d statistics then? --bb
Oct 23 2008
prev sibling parent reply Don <nospam nospam.com.au> writes:
dsimcha wrote:
 Since there's really no good comprehensive statistics library for D (Tango has
 a little bit, the beginnings of a few are on dsource, but nothing much), Ive
 been rolling my own statistics functions as necessary.  Almost by accident, it
 seems like I've built up the beginnings of a decent statistics library.  I'm
 debating whether it might be interesting enough to people to be worth
 releasing, and whether enough community help would be available to really make
 it production quality, or to merge it with other people's efforts in this
 area.  The following functionality is currently available:

 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.  Inverse normal distribution,

Most of these are in Tango (not Kolmogorov). Are yours different in some way?
 A struct to generate all possible permutations of a sequence.

 Correlation (Pearson, Spearman rho, Kendall tau).   Note that the 

 tau correlation is a very efficient O(N log N) version.

 Mean, standard deviation, variance, kurtosis, percent variance for 

 numeric values.

 Shannon entropy, mutual information.

 Kolmogorov-Smirnov tests

Sounds good.
 
 
 On the other hand, I'm a scientist, not a full-time programmer, 

Me too!
 Is there any interest in this from others in the D community?  Do other people
 think that D would benefit from having a decent statistics library?  

Yes. Which is why I put the existing stuff into Tango.
Oct 24 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Don (nospam nospam.com.au)'s article
 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.  Inverse normal distribution,

way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N. Also, Tango doesn't have hypergeometric.
 A struct to generate all possible permutations of a sequence.

> Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.
 On the other hand, I'm a scientist, not a full-time programmer,

 Is there any interest in this from others in the D community?  Do other people
 think that D would benefit from having a decent statistics library?


BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.
Oct 24 2008
parent Don <nospam nospam.com.au> writes:
dsimcha wrote:
 == Quote from Don (nospam nospam.com.au)'s article
 Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric,
 Poisson, binomial PDFs.  Inverse normal distribution,

way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N.

That shouldn't be necessary. If logGamma() isn't giving an accurate factorial (within a couple of bits of precision), that's a problem with logGamma. Please generate a bug report.
 Also, Tango doesn't have hypergeometric.

You're right. It's still on my hard disk, I wasn't quite happy it.
 
 A struct to generate all possible permutations of a sequence.

> Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.
 On the other hand, I'm a scientist, not a full-time programmer,

 Is there any interest in this from others in the D community?  Do other people
 think that D would benefit from having a decent statistics library?


BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.

Oct 25 2008