## digitalmars.D.announce - Statistics library

- dsimcha <dsimcha yahoo.com> Oct 23 2008
- bearophile <bearophileHUGS lycos.com> Oct 23 2008
- BCS <ao pathlink.com> Oct 23 2008
- dsimcha <dsimcha yahoo.com> Oct 24 2008
- BCS <ao pathlink.com> Oct 24 2008
- dsimcha <dsimcha yahoo.com> Oct 25 2008
- BCS <ao pathlink.com> Oct 26 2008
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Oct 23 2008
- BCS <ao pathlink.com> Oct 23 2008
- dsimcha <dsimcha yahoo.com> Oct 23 2008
- Don <nospam nospam.com> Oct 30 2008
- Walter Bright <newshound1 digitalmars.com> Oct 23 2008
- Dejan Lekic <dejan.lekic tiscali.co.uk> Oct 27 2008
- dsimcha <dsimcha yahoo.com> Oct 27 2008
- "Bill Baxter" <wbaxter gmail.com> Oct 23 2008
- dsimcha <dsimcha yahoo.com> Oct 23 2008
- dsimcha <dsimcha yahoo.com> Oct 23 2008
- "Bill Baxter" <wbaxter gmail.com> Oct 23 2008
- Don <nospam nospam.com.au> Oct 24 2008
- dsimcha <dsimcha yahoo.com> Oct 24 2008
- Don <nospam nospam.com.au> Oct 25 2008

Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

Oct 23 2008

dsimcha, I think the struct to generate permutations is out of place there, and more fit in a module like the comb (combinatorics) of mine. Beside that detail, I like the idea of having a standard module with basic statistics, so I am interested :-) Bye, bearophile

Oct 23 2008

Reply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

Well for starters, just ask and I'll get you access to put it on scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Oct 23 2008

== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

Sounds good at least for now. It's only about 1500 lines of code including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

Oct 24 2008

Reply to dsimcha,== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area.

scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

If you don't already have access send me your username and I'll add you.

Oct 24 2008

== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,== Quote from BCS (ao pathlink.com)'s articleReply to dsimcha,

scrapple. That's if you don't want to go to the trouble of having your own project (it's not much trouble BTW)

including unittests, comments, etc. I'll put it up on scrapple with a permissive license, and people can make suggestions, and integrate it into Phobos and Tango as they see fit.

Username: dsimcha Yes, I realize that it's best to do things like this off the newsgroup, but your email address doesn't seem to work.

Oct 25 2008

Reply to dsimcha,Yes, I realize that it's best to do things like this off the newsgroup, but your email address doesn't seem to work.

Sorry. I figure I get enough SPAM as it is. Besides, there are about 2 dozen other ways to get it if you are persistent enough Oh. Your in, have fun (I really ought to make up some boiler plate like this: http://en.wikipedia.org/wiki/Sudo#Design ;)

Oct 26 2008

dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

If the community is interested, I'd be glad to take over your code and put it in Phobos. Andrei

Oct 23 2008

Reply to Andrei,dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence. On the other hand, I'm a scientist, not a full-time programmer, and although I can write working code, I have no clue what it takes to get code up to the gold standard of "production." Also, this library is very D2-dependent, and I have no interest in back-porting it. Of course if by some chance someone else wanted to back-port it, they'd be more than welcome. Most of the code is covered somehow or another by unit tests, although I cheated a lot by having some unit tests depend on multiple functions. Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library? Other comments?

put it in Phobos. Andrei

Even better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

Oct 23 2008

== Quote from BCS (ao pathlink.com)'s articleEven better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in replacement std.bigint (for factorial, although all functions that actually use a factorial are calculated in log space, and therefore don't depend on this), std.algorithm (for swap, isSorted), std.random

Oct 23 2008

dsimcha wrote:== Quote from BCS (ao pathlink.com)'s articleEven better would be getting it in both Phobos and Tango. Shouldn't be hard as I can't think it should depend on much.

First, Tango needs to be ported to D2 (I realize that this is happening) or my code needs to be ported to D1. Anyhow, here are the dependencies: Non-trivial, i.e. in several places: std.math, std.traits, std.functional, some custom sorting functions I wrote, which could just be included

Trivial, i.e. in only one or two small places, pretty sure Tango has a drop-in replacement std.bigint (for factorial, although all functions that actually use a factorial are calculated in log space, and therefore don't depend on this), std.algorithm (for swap, isSorted), std.random

This is another instance where we need a common namespace. Practically nothing in tango.math.* has dependencies on other parts of Tango, other than on the stuff which is now in Core. All the modules you mention would ideally be in the common namespace.

Oct 30 2008

Andrei Alexandrescu wrote:If the community is interested, I'd be glad to take over your code and put it in Phobos.

I'm interested.

Oct 23 2008

Now that I've figured out how the heck to use SVN, it's up on scrapple. Everything basically has unittests and has been dogfooded by me, but if there are any bugs I missed, please file. I'm actually kind of surprised at the level of interest in this project. http://dsource.org/projects/scrapple/browser/trunk/statistics Just a reminder: The current version makes pretty heavy use of D2 features, so it is completely incompatible with D1. D1 compatibility was not even a consideration in the design of anything. I have no intention of backporting to D1, since D2 is getting pretty close to ready anyhow. Also, I'm not a lawyer, but the intent of the license I put in the header of the files is to be very permissive to allow integration into Phobos and Tango of whatever functionality Andrei and Don see fit. If you see a problem with the way I have worded the license, let me know and I'll change it.

Oct 27 2008

On Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

I don't know what a lot of those things are, but statistics to me means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

Oct 23 2008

== Quote from Bill Baxter (wbaxter gmail.com)'s articleOn Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Oct 23 2008

== Quote from Bill Baxter (wbaxter gmail.com)'s articleOk, so it's mainly for 1d statistics then? --bb

Right.

Oct 23 2008

On Fri, Oct 24, 2008 at 9:39 AM, dsimcha <dsimcha yahoo.com> wrote:== Quote from Bill Baxter (wbaxter gmail.com)'s articleOn Fri, Oct 24, 2008 at 7:43 AM, dsimcha <dsimcha yahoo.com> wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available: Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for arrays of numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution, and normally distributed random number generation. A struct to generate all possible permutations of a sequence.

means you will probably have (or eventually want) things like covariance which are best represented as matrices. Does your package also have a matrix library? --bb

No, it doesn't have a matrix library right now. I make no claim that it is in any way complete right now, but I do think it has some pretty useful stuff that's not likely to be anywhere else for D.

Ok, so it's mainly for 1d statistics then? --bb

Oct 23 2008

dsimcha wrote:Since there's really no good comprehensive statistics library for D (Tango has a little bit, the beginnings of a few are on dsource, but nothing much), Ive been rolling my own statistics functions as necessary. Almost by accident, it seems like I've built up the beginnings of a decent statistics library. I'm debating whether it might be interesting enough to people to be worth releasing, and whether enough community help would be available to really make it production quality, or to merge it with other people's efforts in this area. The following functionality is currently available:

Binomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

Most of these are in Tango (not Kolmogorov). Are yours different in some way?A struct to generate all possible permutations of a sequence.

Correlation (Pearson, Spearman rho, Kendall tau). Note that the

tau correlation is a very efficient O(N log N) version. Mean, standard deviation, variance, kurtosis, percent variance for

numeric values. Shannon entropy, mutual information. Kolmogorov-Smirnov tests

Sounds good.On the other hand, I'm a scientist, not a full-time programmer,

Me too!Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

Yes. Which is why I put the existing stuff into Tango.

Oct 24 2008

== Quote from Don (nospam nospam.com.au)'s articleBinomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N. Also, Tango doesn't have hypergeometric.A struct to generate all possible permutations of a sequence.

> Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.On the other hand, I'm a scientist, not a full-time programmer,

Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.

Oct 24 2008

dsimcha wrote:== Quote from Don (nospam nospam.com.au)'s articleBinomial, hypergeometric, normal, Poisson, Kolmogorov CDFs, hypergeometric, Poisson, binomial PDFs. Inverse normal distribution,

way?

They calculate the exact log factorial using a caching scheme. Not sure how much accuracy this actually buys, though it costs some memory. I should probably change the logFactorial function to a gamma approximation at least for large N.

That shouldn't be necessary. If logGamma() isn't giving an accurate factorial (within a couple of bits of precision), that's a problem with logGamma. Please generate a bug report.Also, Tango doesn't have hypergeometric.

You're right. It's still on my hard disk, I wasn't quite happy it.A struct to generate all possible permutations of a sequence.

> Correlation (Pearson, Spearman rho, Kendall tau). Note that the Kendall > tau correlation is a very efficient O(N log N) version. > > Mean, standard deviation, variance, kurtosis, percent variance for arrays of > numeric values. > > Shannon entropy, mutual information. > > Kolmogorov-Smirnov tests Sounds good.

This is more the part that I thought might be useful.On the other hand, I'm a scientist, not a full-time programmer,

Is there any interest in this from others in the D community? Do other people think that D would benefit from having a decent statistics library?

BCS has offered me Scrapple access, I'll post the code there under a permissive license. From there, Tango and Phobos devs can look at it and do as they see fit.

Oct 25 2008