## digitalmars.D - Matrix/Linear Algebra Library?

- Benji Smith <dlanguage benjismith.net> Oct 02 2008
- "Bill Baxter" <wbaxter gmail.com> Oct 02 2008
- "Bill Baxter" <wbaxter gmail.com> Oct 02 2008
- Don Clugston <nospam nospam.com> Oct 03 2008
- Fawzi Mohamed <fmohamed mac.com> Oct 03 2008
- Fawzi Mohamed <fmohamed mac.com> Oct 03 2008
- TomD <t_demmer nospam.web.de> Oct 03 2008
- Benji Smith <dlanguage benjismith.net> Oct 03 2008
- Oliver Dathe <o.dathe gmx.de> Oct 02 2008
- Benji Smith <dlanguage benjismith.net> Oct 02 2008
- "Bill Baxter" <wbaxter gmail.com> Oct 03 2008

Does anyone have a recommendation for a linear algebra library in D? I need to do a singular value decomposition on large sparse matrices (approx 100,000 x 100,000). I've glanced at the Blade library, on DSource, but my linear algebra is a bit rusty, and I can't tell whether the library simply lacks the ability to do an SVD or whether it can be done, but only by composing other operations. Thanks! --benji

Oct 02 2008

You need something like ARPACK to do SVD on large sparse matrices. I have implemented a wrapper for Python before. And I could use one for D right now too, but I haven't done it yet. I have a bunch of other linear algebra wrappers in my MultiArray project on dsource, including LAPACK which can do SVD on dense matrices. There's also GSL wrappers someone contributed in the bindings project. I don't know much about it, but supposedly it at least contains BLAS wrappers. I don't know if it has anything much beyond that though. If I needed something like this today, I'd put an Arpack wrapper into multiarray, following the pattern used by the Blas, Lapack, SuperLU, Umfpack, and Taucs wrappers I've already written. Arpack isn't exactly easy to call directly, so then I'd follow up by looking into what SciPy has got for Arpack wrappers these days and try porting that to D. --bb On Fri, Oct 3, 2008 at 6:50 AM, Benji Smith <dlanguage benjismith.net> wrote:Does anyone have a recommendation for a linear algebra library in D? I need to do a singular value decomposition on large sparse matrices (approx 100,000 x 100,000). I've glanced at the Blade library, on DSource, but my linear algebra is a bit rusty, and I can't tell whether the library simply lacks the ability to do an SVD or whether it can be done, but only by composing other operations. Thanks! --benji

Oct 02 2008

Oh, and Blade is a very low-level library aimed at optimizing the very simplest linear algebra expressions like vector+vector. Last I checked it didn't contain any matrix ops at all. And it seems like Don is frying different fish these days (like Tango BigInt). --bb On Fri, Oct 3, 2008 at 6:50 AM, Benji Smith <dlanguage benjismith.net> wrote:Does anyone have a recommendation for a linear algebra library in D? I need to do a singular value decomposition on large sparse matrices (approx 100,000 x 100,000). I've glanced at the Blade library, on DSource, but my linear algebra is a bit rusty, and I can't tell whether the library simply lacks the ability to do an SVD or whether it can be done, but only by composing other operations. Thanks! --benji

Oct 02 2008

Bill Baxter wrote:Oh, and Blade is a very low-level library aimed at optimizing the very simplest linear algebra expressions like vector+vector. Last I checked it didn't contain any matrix ops at all. And it seems like Don is frying different fish these days (like Tango BigInt).

It's pretty much obselete now that array ops are in the language. That always happens to my most interesting, sophisticated code -- Walter turns it into a one-liner. BTW BigInt will eventually get to Phobos, too.--bb On Fri, Oct 3, 2008 at 6:50 AM, Benji Smith <dlanguage benjismith.net> wrote:

Oct 03 2008

On 2008-10-03 10:27:27 +0200, Don Clugston <nospam nospam.com> said:Bill Baxter wrote:Oh, and Blade is a very low-level library aimed at optimizing the very simplest linear algebra expressions like vector+vector. Last I checked it didn't contain any matrix ops at all. And it seems like Don is frying different fish these days (like Tango BigInt).

It's pretty much obselete now that array ops are in the language. That always happens to my most interesting, sophisticated code -- Walter turns it into a one-liner. BTW BigInt will eventually get to Phobos, too.--bb On Fri, Oct 3, 2008 at 6:50 AM, Benji Smith <dlanguage benjismith.net> wrote:

I have written dense multidimensional arrays in D. They provide some wrapper to various Lapack functions (through Bill's wrappers). The library is available at http://github.com/fawzi SVD is also there, but just for *dense* matrixes. Bill's wrappers and his dflat also work with sparse matrixes. Actually I think that you should think more about your problem If you need full SVD U and V matrixes are *full* even if A is sparse. 100'000x100'000 double matrix ~74GB, floats is half of it, but you have at least two matrixes. Either you have really big resources, or you switch to a cluster and scalapack, or (better) you reformulate your problem so that you either need just a partial decomposition, or you can solve it iteratively. Fawzi

Oct 03 2008

On 2008-10-03 11:46:11 +0200, Fawzi Mohamed <fmohamed mac.com> said:On 2008-10-03 10:27:27 +0200, Don Clugston <nospam nospam.com> said:Bill Baxter wrote:Oh, and Blade is a very low-level library aimed at optimizing the very simplest linear algebra expressions like vector+vector. Last I checked it didn't contain any matrix ops at all. And it seems like Don is frying different fish these days (like Tango BigInt).

It's pretty much obselete now that array ops are in the language. That always happens to my most interesting, sophisticated code -- Walter turns it into a one-liner. BTW BigInt will eventually get to Phobos, too.--bb On Fri, Oct 3, 2008 at 6:50 AM, Benji Smith <dlanguage benjismith.net> wrote:

I have written dense multidimensional arrays in D. They provide some wrapper to various Lapack functions (through Bill's wrappers). The library is available at http://github.com/fawzi SVD is also there, but just for *dense* matrixes. Bill's wrappers and his dflat also work with sparse matrixes. Actually I think that you should think more about your problem If you need full SVD U and V matrixes are *full* even if A is sparse. 100'000x100'000 double matrix ~74GB, floats is half of it, but you have at least two matrixes. Either you have really big resources, or you switch to a cluster and scalapack, or (better) you reformulate your problem so that you either need just a partial decomposition, or you can solve it iteratively. Fawzi

Probably if you want least square approximation (that is what you meant with LSA?) some kind of direct minimizer is the way to go (but be careful the problem might ill conditioned). Fawzi

Oct 03 2008

Fawzi Mohamed Wrote: [...]Probably if you want least square approximation (that is what you meant with LSA?) some kind of direct minimizer is the way to go (but be careful the problem might ill conditioned). Fawzi

for a conjugate gradient method (if the matrix is symmetric), or do some sort of pre-conditioning. But then again, the last time I did CFD was some 10 years ago. Ciao Tom

Oct 03 2008

Fawzi Mohamed wrote:Probably if you want least square approximation (that is what you meant with LSA?) some kind of direct minimizer is the way to go (but be careful the problem might ill conditioned). Fawzi

Sorry. I meant "Latent Semantic Analysis". I'm building a table of synonyms from a large document repository. There are about 100,000 documents in the repository, and the total corpus has about 100,000 unique words (after stemming). The LSA algorithm uses a matrix, where each row represents a single term, and each column represents a single document. Each cell contains the TF-IDF (Term Frequency / Inverse Document Frequency). Performing an SVD on the matrix yields a collelation matrix, representing the synonymy of all terms, in all documents. --benji

Oct 03 2008

I used LinAlg of the Shark-Project for SVD but is has no sparse matrix support and is C++. http://shark-project.sourceforge.net/ GNU Scientific Library http://www.gnu.org/software/gsl/ http://www.gnu.org/software/gsl/manual/html_node/Singular-Value-Decomposition.html It seem to have no sparse matrix support but there are bindings in http://www.dsource.org/projects/bindings/browser/trunk/gsl SVDLIBC http://tedlab.mit.edu/~dr/SVDLIBC/ It has sparse matrix support but you have to make bindings yourself or try bcd for that http://www.dsource.org/projects/bcd NumPy http://www.scipy.org/doc/numpy_api_docs/numpy.linalg.linalg.html#svd I've never used Python but there is Kirks well known pyd for interfacing between Python and D. Sure you need SVD? Often it is used for least squares fitting or pseudoinverse. So depending on what you are doing there may be weaker alternatives like Moore-Penrose pseudoinverse.

Oct 02 2008

Oliver Dathe wrote:Sure you need SVD? Often it is used for least squares fitting or pseudoinverse. So depending on what you are doing there may be weaker alternatives like Moore-Penrose pseudoinverse.

Yeah, actually, I definitely need SVD. I'm putting together an LSA algorithm for a machine-learning project. My first instinct was to use Colt & JAMA (both Java libs) since the rest of the project is in Java. But I thought it would be interesting to compare the Java implementation with one in D, both performance-wise and API-wise. Thanks, everybody, for your quick responses! --benji

Oct 02 2008

On Sat, Oct 4, 2008 at 1:10 AM, Benji Smith <dlanguage benjismith.net> wrote:Fawzi Mohamed wrote:Probably if you want least square approximation (that is what you meant with LSA?) some kind of direct minimizer is the way to go (but be careful the problem might ill conditioned). Fawzi

Sorry. I meant "Latent Semantic Analysis". I'm building a table of synonyms from a large document repository. There are about 100,000 documents in the repository, and the total corpus has about 100,000 unique words (after stemming). The LSA algorithm uses a matrix, where each row represents a single term, and each column represents a single document. Each cell contains the TF-IDF (Term Frequency / Inverse Document Frequency). Performing an SVD on the matrix yields a collelation matrix, representing the synonymy of all terms, in all documents.

How many nonzeros? Assuming each document has maybe 10K terms that's still a billion nonzero entries, and the SVD's U and V matrices will be dense. That is a lot of data. Fawzi's right, you probably need some bigger hammers for that, like a distributed or out of core SVD routine. But given the application I'm also guessing you only care about the vectors associated with the largest few singular values, so you should be looking for a routine that can find a few singular values at a time. The Arnoldi methods in ARPACK can do that kind of thing. I can't remember if that handles sparse matrices too, but I think it does. I seem to recall that you provide your own Mat*Vector calculation (and maybe dot or vector addition ops too?) and it does the rest. --bb

Oct 03 2008