digitalmars.D - 32-bit Memory Limitations

John Demme (27/27) Sep 03 2010 Hello all-

bearophile (5/6) Sep 03 2010 I think few Walter-months.

Andrei Alexandrescu (5/9) Sep 03 2010 Should be quicker than that. I think we'll have something usable before

John Demme (2/14) Sep 03 2010

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= (4/10) Sep 03 2010 Very. What you do with 4-D matrices? Tell us!

BCS (5/18) Sep 03 2010 http://en.wikipedia.org/wiki/Tensor#Continuum_mechanics
John Demme (38/50) Sep 03 2010 It's a graph similarity comparison roughly based upon Similarity Floodin...

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= (17/27) Sep 07 2010 al

BCS (7/38) Sep 08 2010 If enough elements from the 4d matrix are accessed, in the wrong order, ...

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= (17/26) Sep 08 2010 t =

John Demme (8/27) Sep 08 2010 That's an interesting thought. It'd be a bit trickier than that since I...

John Demme <jdd cs.columbia.edu> writes:

Hello all-

I apologize since I'm sure this question has been asked numerous times 
previous, but I could not find it in the last 10k messages.

Is there a rough time line for 64-bit DMD 2 support? (For Linux--I don't 
care about Windows.)  I understand that Walter is working on it and 
certainly don't expect a firm date, but I have no sense for the amount of 
work involved... Is this a one man-week feature or several man-months?

I know GDC has 64-bit support, but it has not been synced up in some time.  
Does anyone know if this will be updated in the near future.

I ask because I have written a scientific application I would like to  
operate on a very large matrix of doubles--in the range of 200^4 elements-- 
requiring about 12GB of memory, far larger than the current ~4GB limit.  
Ideally, I'd like to ramp this up to even 20GB matrices.  I'm currently 
running on machines with 24GB of RAM (and we may upgrade a few next year) so 
this is not a performance issue, merely a software issue.

Additionally, is anyone aware of any extreme cleverness to transparently 
work around this issue?  I would imagine not, but I'm constantly amazed by 
some of the hacks I've seen.

4GB limits me to about 150^4 elements, which is acceptable for the time 
being.  As such, I'm not terribly interested in any extreme hacks to get 
around this.  I could obviously multi-process the computation (which would 
help in distributing it) but I don't need to do this yet.

(Yes, all of those exponents are 4, not 2.  This is actually a 4 dimensional 
matrix, but for the purpose of most parts of the computation, I can treat it 
like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps 
interesting.)

~John

Sep 03 2010

bearophile <bearophileHUGS lycos.com> writes:

John Demme:
 Is this a one man-week feature or several man-months?

I think few Walter-months.
Each Walter-month is about three human-months ;-)

Bye,
bearophile

Sep 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/3/10 17:10 CDT, bearophile wrote:
 John Demme:
 Is this a one man-week feature or several man-months?

 I think few Walter-months.
 Each Walter-month is about three human-months ;-)

Should be quicker than that. I think we'll have something usable before 
the end of this month. Walter wanted to get to alpha in August, but was 
detoured by work on optlink.

Andrei

Sep 03 2010

John Demme <jdd cs.columbia.edu> writes:

Excellent.  Thanks very much.

Andrei Alexandrescu wrote:

 On 9/3/10 17:10 CDT, bearophile wrote:
 John Demme:
 Is this a one man-week feature or several man-months?

 I think few Walter-months.
 Each Walter-month is about three human-months ;-)

 
 Should be quicker than that. I think we'll have something usable before
 the end of this month. Walter wanted to get to alpha in August, but was
 detoured by work on optlink.
 
 Andrei

Sep 03 2010

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:

Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisa=B3(a)=
:

 (Yes, all of those exponents are 4, not 2.  This is actually a 4  =

 dimensional
 matrix, but for the purpose of most parts of the computation, I can  =

 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)

Very. What you do with 4-D matrices? Tell us!

Tomek

Sep 03 2010

BCS <none anon.com> writes:

Hello Tomek,

 Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu>
 napisał(a):
 
 (Yes, all of those exponents are 4, not 2.  This is actually a 4
 dimensional
 matrix, but for the purpose of most parts of the computation, I can
 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)

 Very. What you do with 4-D matrices? Tell us!
 

http://en.wikipedia.org/wiki/Tensor#Continuum_mechanics

FEA on elastic solids?

 Tomek
 

-- 
... <IXOYE><

Sep 03 2010

John Demme <jdd cs.columbia.edu> writes:

Tomek Sowiński wrote:

 Dnia 03-09-2010 o 23:37:18 John Demme <jdd cs.columbia.edu> napisał(a):
 
 (Yes, all of those exponents are 4, not 2.  This is actually a 4
 dimensional
 matrix, but for the purpose of most parts of the computation, I can
 treat it
 like a typical 2-dim matrix.  Not relevant, I suppose, but perhaps
 interesting.)

 
 Very. What you do with 4-D matrices? Tell us!
 
 Tomek

It's a graph similarity comparison roughly based upon Similarity Flooding[1]

To summarize briefly, this large adjacency matrix assists in mapping nodes 
from one graph to their most similar counterparts in another graph.  It does 
so inexactly in approximately n^5 time and n^4 space rather than exponential 
and allows me to tweak my definition of "similar" as much as I like.

As for the graphs, I essentially take two input graphs, represented in 
adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal 
sized graphs).  Then, I compute the Kronecker Tensor Graph Product[2], which 
creates a matrix of size n^4.  Depending on how you think about it, this 
matrix is a simple (although large) 2-D adjacency matrix of the product 
graph, and it can be treated as such for many operations.  It can also be 
inspected in four dimensional space to examine the relationships between 
possible node pairs, but I don't do this.  Frankly, it hurts my brain a 
little bit.

Another interesting note, I originally implemented the algorithm in Haskell, 
however the space overhead was far too high, and I wasn't able to 
parallelize the bits which should have been easy to parallelize.  Elegant as 
Haskell may be, it turns out writing efficient Haskell programs is a helluva 
black art.

A further aside, I run this comparison for m graphs, so we're talking about 
m^2 computations of about n^5 time, n^4 space.  I'm hoping D will help me 
parallelize many of the matrix computations, and should also make it easier 
to distribute the m^2 over many machines. (Up to 72 machines with 16 logical 
cores and 24GB each.)  I was doing this with the Haskell version, but very 
inefficiently and via a number of very hack-y poorly automated hacks.  I 
need to increase the efficiency and level of automation (via D) so that I 
can run this analysis many, many times -- I'm planning on tuning the 
similarity flooding parameters using machine learning.

Anyway, it'll end up being a huge number of CPU hours no matter how I slice 
it.  So far it looks like D will help me cut a bunch of hours off (versus 
the Haskell version) without adding too many hours to my work week.  
Downloaded DMD a few days ago, already ported most of my code and cleaned it 
up a bit.  It's nice to be back on D :)  (BTW, I only decided to come back 
after reading Andrei's new book.)

~John

[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=994702
[2] http://en.wikipedia.org/wiki/Tensor_product_of_graphs

Sep 03 2010

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:

Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisa=B3(a)=
:

 As for the graphs, I essentially take two input graphs, represented in=

 adjacency matrix form (two 2-d matrices of size n^2 each, assuming equ=

al
 sized graphs).  Then, I compute the Kronecker Tensor Graph Product[2],=

  =

 which
 creates a matrix of size n^4.  Depending on how you think about it, th=

is
 matrix is a simple (although large) 2-D adjacency matrix of the produc=

t
 graph, and it can be treated as such for many operations.  It can also=

 be
 inspected in four dimensional space to examine the relationships betwe=

en
 possible node pairs, but I don't do this.  Frankly, it hurts my brain =

a
 little bit.

Can't you compute the Kronecker product lazily? E.g. a proxy object that=
  =

computes a value in an overloaded opIndex. Even if your algorithms inspe=
ct  =

(compute) the same value several times, you may still win -- the  =

bottleneck these days is memory access, not CPU cycles.

Fascinating stuff you're dealing with... Good luck.

Tomek

Sep 07 2010

BCS <none anon.com> writes:

Hello Tomek,

 Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu>
 napisal(a):
 
 As for the graphs, I essentially take two input graphs, represented
 in
 adjacency matrix form (two 2-d matrices of size n^2 each, assuming
 equal
 sized graphs).  Then, I compute the Kronecker Tensor Graph
 Product[2],
 which
 creates a matrix of size n^4.  Depending on how you think about it,
 this
 matrix is a simple (although large) 2-D adjacency matrix of the
 product
 graph, and it can be treated as such for many operations.  It can
 also be
 inspected in four dimensional space to examine the relationships
 between
 possible node pairs, but I don't do this.  Frankly, it hurts my brain
 a
 little bit.

 Can't you compute the Kronecker product lazily? E.g. a proxy object
 that  computes a value in an overloaded opIndex. Even if your
 algorithms inspect  (compute) the same value several times, you may
 still win -- the  bottleneck these days is memory access, not CPU
 cycles.
 

If enough elements from the 4d matrix are accessed, in the wrong order, then 
the cache effects of doing it lazily might kill it. I'd guess that highly 
optimized code for doing the pre-compute version exists already.
 
 Fascinating stuff you're dealing with... Good luck.
 
 Tomek
 

-- 
... <IXOYE><

Sep 08 2010

=?iso-8859-2?B?VG9tZWsgU293afFza2k=?= <just ask.me> writes:

Dnia 08-09-2010 o 15:58:31 BCS <none anon.com> napisa=B3(a):

 Can't you compute the Kronecker product lazily? E.g. a proxy object
 that  computes a value in an overloaded opIndex. Even if your
 algorithms inspect  (compute) the same value several times, you may
 still win -- the  bottleneck these days is memory access, not CPU
 cycles.

 If enough elements from the 4d matrix are accessed, in the wrong order=

,  =

 then the cache effects of doing it lazily might kill it. I'd guess tha=

t  =

 highly optimized code for doing the pre-compute version exists already=

.

Hm.. not sure what you mean by 'cache effects'. He was talking about  =

working with a 200^4 matrix of doubles, which is a result of Kronecker  =

product on two 200^2 matrices. Now, if my maths are right, the lazy  =

version needs (2*200^2) * 8 =3D 640000 bytes of memory. So the whole thi=
ng  =

fits comfortably into the on-die cache, and large chunks can be loaded t=
o  =

the faster per-core caches.

I'd say if the cache effects can kill anything, it'd be accessing elemen=
ts  =

of the precomputed result which is 200^4 * 8 =3D 12,800,000,000 bytes bi=
g.


Tomek

Sep 08 2010

John Demme <jdd cs.columbia.edu> writes:

Tomek Sowiński wrote:

 Dnia 04-09-2010 o 08:03:12 John Demme <jdd cs.columbia.edu> napisał(a):
 
 As for the graphs, I essentially take two input graphs, represented in
 adjacency matrix form (two 2-d matrices of size n^2 each, assuming equal
 sized graphs).  Then, I compute the Kronecker Tensor Graph Product[2],
 which
 creates a matrix of size n^4.  Depending on how you think about it, this
 matrix is a simple (although large) 2-D adjacency matrix of the product
 graph, and it can be treated as such for many operations.  It can also be
 inspected in four dimensional space to examine the relationships between
 possible node pairs, but I don't do this.  Frankly, it hurts my brain a
 little bit.

 
 Can't you compute the Kronecker product lazily? E.g. a proxy object that
 computes a value in an overloaded opIndex. Even if your algorithms inspect
 (compute) the same value several times, you may still win -- the
 bottleneck these days is memory access, not CPU cycles.

That's an interesting thought.  It'd be a bit trickier than that since I'm 
doing some post-processing on the product, but not enough to make this 
impossible.  I'll probably give this a shot if I really need to scale up -- 
it'd be difficult to fight n^4 space with more RAM... though I'd sure like 
to give it a shot :)

 
 Fascinating stuff you're dealing with... Good luck.

Thanks

~John

Sep 08 2010

D Programming

C/C++ Programming

Other

digitalmars.D - 32-bit Memory Limitations