## digitalmars.D.announce - Statistics library

- dsimcha (28/28) Oct 23 2008 Since there's really no good c...
- bearophile (4/4) Oct 23 2008 dsimcha, I think the struct to...
- BCS (4/13) Oct 23 2008 Well for starters, just ask an...
- dsimcha (5/18) Oct 24 2008 Sounds good at least for now. ...
- Andrei Alexandrescu (4/43) Oct 23 2008 If the community is interested...
- BCS (3/53) Oct 23 2008 Even better would be getting i...
- dsimcha (11/13) Oct 23 2008 First, Tango needs to be porte...
- Don (5/20) Oct 30 2008 This is another instance where...
- Walter Bright (2/4) Oct 23 2008 I'm interested....
- Dejan Lekic (1/1) Oct 27 2008 If my vote counts - I am all f...
- dsimcha (13/13) Oct 27 2008 Now that I've figured out how ...
- Bill Baxter (6/24) Oct 23 2008 I don't know what a lot of tho...
- dsimcha (4/35) Oct 23 2008 No, it doesn't have a matrix l...
- Bill Baxter (3/38) Oct 23 2008 Ok, so it's mainly for 1d stat...
- dsimcha (2/4) Oct 23 2008 Right....
- Don (8/30) Oct 24 2008 Most of these are in Tango (no...
- dsimcha (8/31) Oct 24 2008 They calculate the exact log f...
- Don (6/43) Oct 25 2008 That shouldn't be necessary. I...
- dsimcha (4/7) Jan 02 2009 Yes, I'm replying to a post fr...

Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

Oct 23 2008

dsimcha, I think the struct to generate permutations is out of place there, and more fit in a module like the comb (combinatorics) of mine. Beside that detail, I like the idea of having a standard module with basic statistics, so I am interested :-) Bye, bearophile

Oct 23 2008

Reply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Oct 23 2008

== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Sounds good at least for now. It's only about 1500 lines of code including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

Oct 24 2008

Reply to dsimcha,== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Sounds good at least for now. It's only about 1500 lines of code including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

If you don't already have access send me your username and I'll add you.

Oct 24 2008

== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,== Quote from BCS (ao pathlink.com)'s article

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Sounds good at least for now. It's only about 1500 lines of code including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

If you don't already have access send me your username and I'll add you.

Username: dsimcha Yes, I realize that it's best to do things like this off the newsgroup, but your email address doesn't seem to work.

Oct 25 2008

Reply to dsimcha,Yes, I realize that it's best to do things like this off the newsgroup, but your email address doesn't seem to work.

Sorry. I figure I get enough SPAM as it is. Besides, there are about 2 dozen other ways to get it if you are persistent enough Oh. Your in, have fun (I really ought to make up some boiler plate like this: http://en.wikipedia.org/wiki/Sudo#Design ;)

Oct 26 2008

dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

If the community is interested, I'd be glad to take over your code and put it in Phobos. Andrei

Oct 23 2008

Reply to Andrei,dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

If the community is interested, I'd be glad to take over your code and put it in Phobos. Andrei

Even better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

Oct 23 2008

== Quote from BCS (ao pathlink.com)'s articleEven better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in replacement std.bigint (for factorial, although all functions that actually use a factorial are calculated in log space, and therefore don't depend on this), std.algorithm (for swap, isSorted), std.random

Oct 23 2008

dsimcha wrote:== Quote from BCS (ao pathlink.com)'s articleEven better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included

Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in replacement std.bigint (for factorial, although all functions that actually use a factorial are calculated in log space, and therefore don't depend on this), std.algorithm (for swap, isSorted), std.random

This is another instance where we need a common namespace. Practically nothing in tango.math.* has dependencies on other parts of Tango, other than on the stuff which is now in Core. All the modules you mention would ideally be in the common namespace.

Oct 30 2008

Andrei Alexandrescu wrote:If the community is interested, I'd be glad to take over your code and put it in Phobos.

I'm interested.

Oct 23 2008

Now that I've figured out how the heck to use SVN, it's up on scrapple. Everything basically has unittests and has been dogfooded by me, but if there are any bugs I missed, please file. I'm actually kind of surprised at the level of interest in this project. http://dsource.org/projects/scrapple/browser/trunk/statistics Just a reminder: The current version makes pretty heavy use of D2 features, so it is completely incompatible with D1. D1 compatibility was not even a consideration in the design of anything. I have no intention of backporting to D1, since D2 is getting pretty close to ready anyhow. Also, I'm not a lawyer, but the intent of the license I put in the header of the files is to be very permissive to allow integration into Phobos and Tango of whatever functionality Andrei and Don see fit. If you see a problem with the way I have worded the license, let me know and I'll change it.

Oct 27 2008

On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

Oct 23 2008

== Quote from Bill Baxter (wbaxter gmail.com)'s articleOn Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Oct 23 2008

On Fri, Oct 24, 2008 at 9:39 AM, dsimcha <dsimcha yahoo.com> wrote:== Quote from Bill Baxter (wbaxter gmail.com)'s articleOn Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Ok, so it's mainly for 1d statistics then? --bb

Oct 23 2008

== Quote from Bill Baxter (wbaxter gmail.com)'s articleOk, so it's mainly for 1d statistics then? --bb

Right.

Oct 23 2008

dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available:

Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

Most of these are in Tango (not Kolmogorov). Are yours different in some way?A struct to generate all possible permutations of a sequence.

Correlation (Pearson, Spearman rho, Kendall tau). Note that the

Kendalltau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for

arrays ofnumeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests

Sounds good.On the other hand, I'm a scientist, not a full-time programmer,

Me too!Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

Yes. Which is why I put the existing stuff into Tango.

Oct 24 2008

== Quote from Don (nospam nospam.com.au)'s articleBinomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

Most of these are in Tango (not Kolmogorov). Are yours different in some way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N. Also, Tango doesn't have hypergeometric.A struct to generate all possible permutations of a sequence.

> > Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.On the other hand, I'm a scientist, not a full-time programmer,

Me too!Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

Yes. Which is why I put the existing stuff into Tango.

BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.

Oct 24 2008

dsimcha wrote:== Quote from Don (nospam nospam.com.au)'s articleBinomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

Most of these are in Tango (not Kolmogorov). Are yours different in some way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N.

That shouldn't be necessary. If logGamma() isn't giving an accurate factorial (within a couple of bits of precision), that's a problem with logGamma. Please generate a bug report.Also, Tango doesn't have hypergeometric.

You're right. It's still on my hard disk, I wasn't quite happy it.A struct to generate all possible permutations of a sequence.

> > Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.On the other hand, I'm a scientist, not a full-time programmer,

Me too!Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

Yes. Which is why I put the existing stuff into Tango.

BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.

Cool!

Oct 25 2008

Don Wrote:Also, Tango doesn't have hypergeometric.

You're right. It's still on my hard disk, I wasn't quite happy it.

Yes, I'm replying to a post from 3+ months ago, but I couldn't find any other way to contact you, Don. I'm trying to improve a few aspects of this library by borrowing code from MathExtra/Tango in some cases (mainly in the distributions module) where the MathExtra/Tango implementation was better than mine. For example, until recently I was unaware that binomial distributions could simply be calculated in O(1) in terms of the incomplete beta function. I'm now also trying to make my hypergeometric CDF code scale this well. Could you please send me, or post somewhere, your hypergeometric CDF code, even if it sucks in its current form, so I can compare it to a few implementations I wrote, try to improve it, etc.? Thanks.

Jan 02 2009