## digitalmars.D - 32-bit Memory Limitations

- John Demme <jdd cs.columbia.edu> Sep 03 2010
- bearophile <bearophileHUGS lycos.com> Sep 03 2010
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Sep 03 2010
- John Demme <jdd cs.columbia.edu> Sep 03 2010
- =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> Sep 03 2010
- BCS <none anon.com> Sep 03 2010
- John Demme <jdd cs.columbia.edu> Sep 03 2010
- John Demme <jdd cs.columbia.edu> Sep 08 2010
- =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> Sep 07 2010
- BCS <none anon.com> Sep 08 2010
- =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> Sep 08 2010

Hello all- I apologize since I'm sure this question has been asked numerous times previous, but I could not find it in the last 10k messages. Is there a rough time line for 64-bit DMD 2 support? (For Linux--I don't care about Windows.) I understand that Walter is working on it and certainly don't expect a firm date, but I have no sense for the amount of work involved... Is this a one man-week feature or several man-months? I know GDC has 64-bit support, but it has not been synced up in some time. Does anyone know if this will be updated in the near future. I ask because I have written a scientific application I would like to operate on a very large matrix of doubles--in the range of 200^4 elements-- requiring about 12GB of memory, far larger than the current ~4GB limit. Ideally, I'd like to ramp this up to even 20GB matrices. I'm currently running on machines with 24GB of RAM (and we may upgrade a few next year) so this is not a performance issue, merely a software issue. Additionally, is anyone aware of any extreme cleverness to transparently work around this issue? I would imagine not, but I'm constantly amazed by some of the hacks I've seen. 4GB limits me to about 150^4 elements, which is acceptable for the time being. As such, I'm not terribly interested in any extreme hacks to get around this. I could obviously multi-process the computation (which would help in distributing it) but I don't need to do this yet. (Yes, all of those exponents are 4, not 2. This is actually a 4 dimensional matrix, but for the purpose of most parts of the computation, I can treat it like a typical 2-dim matrix. Not relevant, I suppose, but perhaps interesting.) ~John

Sep 03 2010

John Demme:Is this a one man-week feature or several man-months?

I think few Walter-months. Each Walter-month is about three human-months ;-) Bye, bearophile

Sep 03 2010

On 9/3/10 17:10 CDT, bearophile wrote:John Demme:Is this a one man-week feature or several man-months?

I think few Walter-months. Each Walter-month is about three human-months ;-)

Should be quicker than that. I think we'll have something usable before the end of this month. Walter wanted to get to alpha in August, but was detoured by work on optlink. Andrei

Sep 03 2010

Excellent. Thanks very much. Andrei Alexandrescu wrote:On 9/3/10 17:10 CDT, bearophile wrote:John Demme:Is this a one man-week feature or several man-months?

I think few Walter-months. Each Walter-month is about three human-months ;-)

Should be quicker than that. I think we'll have something usable before the end of this month. Walter wanted to get to alpha in August, but was detoured by work on optlink. Andrei

Sep 03 2010

Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisa=B3(a)= :(Yes, all of those exponents are 4, not 2. This is actually a 4 =

dimensional matrix, but for the purpose of most parts of the computation, I can =

treat it like a typical 2-dim matrix. Not relevant, I suppose, but perhaps interesting.)

Very. What you do with 4-D matrices? Tell us! Tomek

Sep 03 2010

Hello Tomek,Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisał(a):(Yes, all of those exponents are 4, not 2. This is actually a 4 dimensional matrix, but for the purpose of most parts of the computation, I can treat it like a typical 2-dim matrix. Not relevant, I suppose, but perhaps interesting.)

http://en.wikipedia.org/wiki/Tensor#Continuum_mechanics FEA on elastic solids?Tomek

... <IXOYE><

Sep 03 2010

Tomek Sowiński wrote:Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisał(a):(Yes, all of those exponents are 4, not 2. This is actually a 4 dimensional matrix, but for the purpose of most parts of the computation, I can treat it like a typical 2-dim matrix. Not relevant, I suppose, but perhaps interesting.)

Very. What you do with 4-D matrices? Tell us! Tomek

It's a graph similarity comparison roughly based upon Similarity Flooding[1] To summarize briefly, this large adjacency matrix assists in mapping nodes from one graph to their most similar counterparts in another graph. It does so inexactly in approximately n^5 time and n^4 space rather than exponential and allows me to tweak my definition of "similar" as much as I like. As for the graphs, I essentially take two input graphs, represented in adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal sized graphs). Then, I compute the Kronecker Tensor Graph Product[2], which creates a matrix of size n^4. Depending on how you think about it, this matrix is a simple (although large) 2-D adjacency matrix of the product graph, and it can be treated as such for many operations. It can also be inspected in four dimensional space to examine the relationships between possible node pairs, but I don't do this. Frankly, it hurts my brain a little bit. Another interesting note, I originally implemented the algorithm in Haskell, however the space overhead was far too high, and I wasn't able to parallelize the bits which should have been easy to parallelize. Elegant as Haskell may be, it turns out writing efficient Haskell programs is a helluva black art. A further aside, I run this comparison for m graphs, so we're talking about m^2 computations of about n^5 time, n^4 space. I'm hoping D will help me parallelize many of the matrix computations, and should also make it easier to distribute the m^2 over many machines. (Up to 72 machines with 16 logical cores and 24GB each.) I was doing this with the Haskell version, but very inefficiently and via a number of very hack-y poorly automated hacks. I need to increase the efficiency and level of automation (via D) so that I can run this analysis many, many times -- I'm planning on tuning the similarity flooding parameters using machine learning. Anyway, it'll end up being a huge number of CPU hours no matter how I slice it. So far it looks like D will help me cut a bunch of hours off (versus the Haskell version) without adding too many hours to my work week. Downloaded DMD a few days ago, already ported most of my code and cleaned it up a bit. It's nice to be back on D :) (BTW, I only decided to come back after reading Andrei's new book.) ~John [1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=994702 [2] http://en.wikipedia.org/wiki/Tensor_product_of_graphs

Sep 03 2010

Tomek Sowiński wrote:Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisał(a):As for the graphs, I essentially take two input graphs, represented in adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal sized graphs). Then, I compute the Kronecker Tensor Graph Product[2], which creates a matrix of size n^4. Depending on how you think about it, this matrix is a simple (although large) 2-D adjacency matrix of the product graph, and it can be treated as such for many operations. It can also be inspected in four dimensional space to examine the relationships between possible node pairs, but I don't do this. Frankly, it hurts my brain a little bit.

Can't you compute the Kronecker product lazily? E.g. a proxy object that computes a value in an overloaded opIndex. Even if your algorithms inspect (compute) the same value several times, you may still win -- the bottleneck these days is memory access, not CPU cycles.

That's an interesting thought. It'd be a bit trickier than that since I'm doing some post-processing on the product, but not enough to make this impossible. I'll probably give this a shot if I really need to scale up -- it'd be difficult to fight n^4 space with more RAM... though I'd sure like to give it a shot :)Fascinating stuff you're dealing with... Good luck.

Thanks ~John

Sep 08 2010

Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisa=B3(a)= :As for the graphs, I essentially take two input graphs, represented in=

adjacency matrix form (two 2-d matrices of size n^2 each, assuming equ=

sized graphs). Then, I compute the Kronecker Tensor Graph Product[2],=

which creates a matrix of size n^4. Depending on how you think about it, th=

matrix is a simple (although large) 2-D adjacency matrix of the produc=

graph, and it can be treated as such for many operations. It can also=

inspected in four dimensional space to examine the relationships betwe=

possible node pairs, but I don't do this. Frankly, it hurts my brain =

little bit.

Can't you compute the Kronecker product lazily? E.g. a proxy object that= = computes a value in an overloaded opIndex. Even if your algorithms inspe= ct = (compute) the same value several times, you may still win -- the = bottleneck these days is memory access, not CPU cycles. Fascinating stuff you're dealing with... Good luck. Tomek

Sep 07 2010

Hello Tomek,Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisal(a):As for the graphs, I essentially take two input graphs, represented in adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal sized graphs). Then, I compute the Kronecker Tensor Graph Product[2], which creates a matrix of size n^4. Depending on how you think about it, this matrix is a simple (although large) 2-D adjacency matrix of the product graph, and it can be treated as such for many operations. It can also be inspected in four dimensional space to examine the relationships between possible node pairs, but I don't do this. Frankly, it hurts my brain a little bit.

that computes a value in an overloaded opIndex. Even if your algorithms inspect (compute) the same value several times, you may still win -- the bottleneck these days is memory access, not CPU cycles.

If enough elements from the 4d matrix are accessed, in the wrong order, then the cache effects of doing it lazily might kill it. I'd guess that highly optimized code for doing the pre-compute version exists already.Fascinating stuff you're dealing with... Good luck. Tomek

... <IXOYE><

Sep 08 2010

Dnia 08-09-2010 o 15:58:31 BCS <none anon.com> napisa=B3(a):Can't you compute the Kronecker product lazily? E.g. a proxy object that computes a value in an overloaded opIndex. Even if your algorithms inspect (compute) the same value several times, you may still win -- the bottleneck these days is memory access, not CPU cycles.

If enough elements from the 4d matrix are accessed, in the wrong order=

then the cache effects of doing it lazily might kill it. I'd guess tha=

highly optimized code for doing the pre-compute version exists already=

Hm.. not sure what you mean by 'cache effects'. He was talking about = working with a 200^4 matrix of doubles, which is a result of Kronecker = product on two 200^2 matrices. Now, if my maths are right, the lazy = version needs (2*200^2) * 8 =3D 640000 bytes of memory. So the whole thi= ng = fits comfortably into the on-die cache, and large chunks can be loaded t= o = the faster per-core caches. I'd say if the cache effects can kill anything, it'd be accessing elemen= ts = of the precomputed result which is 200^4 * 8 =3D 12,800,000,000 bytes bi= g. Tomek

Sep 08 2010