www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - 32-bit Memory Limitations

reply John Demme <jdd cs.columbia.edu> writes:
Hello all-

I apologize since I'm sure this question has been asked numerous times 
previous, but I could not find it in the last 10k messages.

Is there a rough time line for 64-bit DMD 2 support? (For Linux--I don't 
care about Windows.)  I understand that Walter is working on it and 
certainly don't expect a firm date, but I have no sense for the amount of 
work involved... Is this a one man-week feature or several man-months?

I know GDC has 64-bit support, but it has not been synced up in some time.  
Does anyone know if this will be updated in the near future.

I ask because I have written a scientific application I would like to  
operate on a very large matrix of doubles--in the range of 200^4 elements-- 
requiring about 12GB of memory, far larger than the current ~4GB limit.  
Ideally, I'd like to ramp this up to even 20GB matrices.  I'm currently 
running on machines with 24GB of RAM (and we may upgrade a few next year) so 
this is not a performance issue, merely a software issue.

Additionally, is anyone aware of any extreme cleverness to transparently 
work around this issue?  I would imagine not, but I'm constantly amazed by 
some of the hacks I've seen.

4GB limits me to about 150^4 elements, which is acceptable for the time 
being.  As such, I'm not terribly interested in any extreme hacks to get 
around this.  I could obviously multi-process the computation (which would 
help in distributing it) but I don't need to do this yet.

(Yes, all of those exponents are 4, not 2.  This is actually a 4 dimensional 
matrix, but for the purpose of most parts of the computation, I can treat it 
like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps 
interesting.)

~John
Sep 03 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
John Demme:
 Is this a one man-week feature or several man-months?
I think few Walter-months. Each Walter-month is about three human-months ;-) Bye, bearophile
Sep 03 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/3/10 17:10 CDT, bearophile wrote:
 John Demme:
 Is this a one man-week feature or several man-months?
I think few Walter-months. Each Walter-month is about three human-months ;-)
Should be quicker than that. I think we'll have something usable before the end of this month. Walter wanted to get to alpha in August, but was detoured by work on optlink. Andrei
Sep 03 2010
parent John Demme <jdd cs.columbia.edu> writes:
Excellent.  Thanks very much.

Andrei Alexandrescu wrote:

 On 9/3/10 17:10 CDT, bearophile wrote:
 John Demme:
 Is this a one man-week feature or several man-months?
I think few Walter-months. Each Walter-month is about three human-months ;-)
Should be quicker than that. I think we'll have something usable before the end of this month. Walter wanted to get to alpha in August, but was detoured by work on optlink. Andrei
Sep 03 2010
prev sibling parent reply =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:
Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisa=B3(a)=
:

 (Yes, all of those exponents are 4, not 2.  This is actually a 4  =
 dimensional
 matrix, but for the purpose of most parts of the computation, I can  =
 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)
Very. What you do with 4-D matrices? Tell us! Tomek
Sep 03 2010
next sibling parent BCS <none anon.com> writes:
Hello Tomek,

 Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu>
 napisał(a):
 
 (Yes, all of those exponents are 4, not 2.  This is actually a 4
 dimensional
 matrix, but for the purpose of most parts of the computation, I can
 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)
Very. What you do with 4-D matrices? Tell us!
http://en.wikipedia.org/wiki/Tensor#Continuum_mechanics FEA on elastic solids?
 Tomek
 
-- ... <IXOYE><
Sep 03 2010
prev sibling parent reply John Demme <jdd cs.columbia.edu> writes:
Tomek Sowiński wrote:

 Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisał(a):
 
 (Yes, all of those exponents are 4, not 2.  This is actually a 4
 dimensional
 matrix, but for the purpose of most parts of the computation, I can
 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)
Very. What you do with 4-D matrices? Tell us! Tomek
It's a graph similarity comparison roughly based upon Similarity Flooding[1] To summarize briefly, this large adjacency matrix assists in mapping nodes from one graph to their most similar counterparts in another graph. It does so inexactly in approximately n^5 time and n^4 space rather than exponential and allows me to tweak my definition of "similar" as much as I like. As for the graphs, I essentially take two input graphs, represented in adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal sized graphs). Then, I compute the Kronecker Tensor Graph Product[2], which creates a matrix of size n^4. Depending on how you think about it, this matrix is a simple (although large) 2-D adjacency matrix of the product graph, and it can be treated as such for many operations. It can also be inspected in four dimensional space to examine the relationships between possible node pairs, but I don't do this. Frankly, it hurts my brain a little bit. Another interesting note, I originally implemented the algorithm in Haskell, however the space overhead was far too high, and I wasn't able to parallelize the bits which should have been easy to parallelize. Elegant as Haskell may be, it turns out writing efficient Haskell programs is a helluva black art. A further aside, I run this comparison for m graphs, so we're talking about m^2 computations of about n^5 time, n^4 space. I'm hoping D will help me parallelize many of the matrix computations, and should also make it easier to distribute the m^2 over many machines. (Up to 72 machines with 16 logical cores and 24GB each.) I was doing this with the Haskell version, but very inefficiently and via a number of very hack-y poorly automated hacks. I need to increase the efficiency and level of automation (via D) so that I can run this analysis many, many times -- I'm planning on tuning the similarity flooding parameters using machine learning. Anyway, it'll end up being a huge number of CPU hours no matter how I slice it. So far it looks like D will help me cut a bunch of hours off (versus the Haskell version) without adding too many hours to my work week. Downloaded DMD a few days ago, already ported most of my code and cleaned it up a bit. It's nice to be back on D :) (BTW, I only decided to come back after reading Andrei's new book.) ~John [1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=994702 [2] http://en.wikipedia.org/wiki/Tensor_product_of_graphs
Sep 03 2010
parent reply =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:
Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisa=B3(a)=
:

 As for the graphs, I essentially take two input graphs, represented in=
 adjacency matrix form (two 2-d matrices of size n^2 each, assuming equ=
al
 sized graphs).  Then, I compute the Kronecker Tensor Graph Product[2],=
=
 which
 creates a matrix of size n^4.  Depending on how you think about it, th=
is
 matrix is a simple (although large) 2-D adjacency matrix of the produc=
t
 graph, and it can be treated as such for many operations.  It can also=
be
 inspected in four dimensional space to examine the relationships betwe=
en
 possible node pairs, but I don't do this.  Frankly, it hurts my brain =
a
 little bit.
Can't you compute the Kronecker product lazily? E.g. a proxy object that= = computes a value in an overloaded opIndex. Even if your algorithms inspe= ct = (compute) the same value several times, you may still win -- the = bottleneck these days is memory access, not CPU cycles. Fascinating stuff you're dealing with... Good luck. Tomek
Sep 07 2010
next sibling parent reply BCS <none anon.com> writes:
Hello Tomek,

 Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu>
 napisal(a):
 
 As for the graphs, I essentially take two input graphs, represented
 in
 adjacency matrix form (two 2-d matrices of size n^2 each, assuming
 equal
 sized graphs).  Then, I compute the Kronecker Tensor Graph
 Product[2],
 which
 creates a matrix of size n^4.  Depending on how you think about it,
 this
 matrix is a simple (although large) 2-D adjacency matrix of the
 product
 graph, and it can be treated as such for many operations.  It can
 also be
 inspected in four dimensional space to examine the relationships
 between
 possible node pairs, but I don't do this.  Frankly, it hurts my brain
 a
 little bit.
Can't you compute the Kronecker product lazily? E.g. a proxy object that computes a value in an overloaded opIndex. Even if your algorithms inspect (compute) the same value several times, you may still win -- the bottleneck these days is memory access, not CPU cycles.
If enough elements from the 4d matrix are accessed, in the wrong order, then the cache effects of doing it lazily might kill it. I'd guess that highly optimized code for doing the pre-compute version exists already.
 Fascinating stuff you're dealing with... Good luck.
 
 Tomek
 
-- ... <IXOYE><
Sep 08 2010
parent =?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:
Dnia 08-09-2010 o 15:58:31 BCS <none anon.com> napisa=B3(a):

 Can't you compute the Kronecker product lazily? E.g. a proxy object
 that  computes a value in an overloaded opIndex. Even if your
 algorithms inspect  (compute) the same value several times, you may
 still win -- the  bottleneck these days is memory access, not CPU
 cycles.
If enough elements from the 4d matrix are accessed, in the wrong order=
, =
 then the cache effects of doing it lazily might kill it. I'd guess tha=
t =
 highly optimized code for doing the pre-compute version exists already=
. Hm.. not sure what you mean by 'cache effects'. He was talking about = working with a 200^4 matrix of doubles, which is a result of Kronecker = product on two 200^2 matrices. Now, if my maths are right, the lazy = version needs (2*200^2) * 8 =3D 640000 bytes of memory. So the whole thi= ng = fits comfortably into the on-die cache, and large chunks can be loaded t= o = the faster per-core caches. I'd say if the cache effects can kill anything, it'd be accessing elemen= ts = of the precomputed result which is 200^4 * 8 =3D 12,800,000,000 bytes bi= g. Tomek
Sep 08 2010
prev sibling parent John Demme <jdd cs.columbia.edu> writes:
Tomek Sowiński wrote:

 Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisał(a):
 
 As for the graphs, I essentially take two input graphs, represented in
 adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal
 sized graphs).  Then, I compute the Kronecker Tensor Graph Product[2],
 which
 creates a matrix of size n^4.  Depending on how you think about it, this
 matrix is a simple (although large) 2-D adjacency matrix of the product
 graph, and it can be treated as such for many operations.  It can also be
 inspected in four dimensional space to examine the relationships between
 possible node pairs, but I don't do this.  Frankly, it hurts my brain a
 little bit.
Can't you compute the Kronecker product lazily? E.g. a proxy object that computes a value in an overloaded opIndex. Even if your algorithms inspect (compute) the same value several times, you may still win -- the bottleneck these days is memory access, not CPU cycles.
That's an interesting thought. It'd be a bit trickier than that since I'm doing some post-processing on the product, but not enough to make this impossible. I'll probably give this a shot if I really need to scale up -- it'd be difficult to fight n^4 space with more RAM... though I'd sure like to give it a shot :)
 
 Fascinating stuff you're dealing with... Good luck.
Thanks ~John
Sep 08 2010