digitalmars.D - XML Benchmarks in D

Scott Sanders (7/7) Mar 12 2008 I have done some benchmarks of the D xml parsers alongside C/C++/Java pa...

Sean Kelly (2/2) Mar 12 2008 Nice work!
Walter Bright (2/7) Mar 12 2008 Reddit link: http://reddit.com/r/programming/info/6bt6n/comments/
N/A (8/12) Mar 12 2008 impact.

Sean Kelly (4/6) Mar 12 2008 I believe the suggested approach in this case is to access the input as ...

N/A (8/14) Mar 12 2008 input?

Scott Sanders (7/23) Mar 12 2008 Should be able to:

N/A (3/9) Mar 13 2008 Thanks,

Scott Sanders (3/15) Mar 13 2008 PullParser is exactly the same, just swap Document!(char) with PullParse...

BCS (7/18) Mar 13 2008 what might be interesting is to make a version that works with slices of...

Kris (2/21) Mar 13 2008 It would be interesting, but isn't that kinda what memory-mapped files p...

BCS (14/29) Mar 13 2008 Not as I understand it (I looked this up about a year ago so I'm a bit r...

Kris (4/24) Mar 13 2008 Reply to BCS:
Alexander Panek (6/37) Mar 14 2008 I've got this strange feeling in my stomach that shouts out "WTF?!" when...

Koroskin Denis (6/43) Mar 14 2008 It sounds strange, but even large companies like Google or Yahoo store

Alexander Panek (2/15) Mar 14 2008 That does, indeed, sound strange. :X
Robert Fraser (2/6) Mar 14 2008 It's a shame the "O RLY?" owl died out years ago...

Jarrett Billingsley (4/10) Mar 14 2008 O RLY?
Bruno Medeiros (6/13) Mar 23 2008 SRSLY?

Christopher Wright (2/15) Mar 23 2008 You know Sir Sly?

Sean Kelly (5/42) Mar 14 2008 It's quite possible that an XML stream could be used as the transport me...
BCS (6/29) Mar 14 2008 Truth be told, I'm not that far from agreeing with you (on seeing that I...

Scott Sanders <scott stonecobra.com> writes:

I have done some benchmarks of the D xml parsers alongside C/C++/Java parsers,
and as you can see from the graphs, D is rocking with Tango!

http://dotnot.org/blog/index.php

I wanted to post to let the D community know that good language and library
design can really make an impact.

As always, I am open to comments/changes/additions, etc.  I will be happy to
run any other project code through the benchmark if someone submits a patch to
me containing the code.

And Walter, I am trying to use "D Programming Language" everywhere I can :)

Cheers,
Scott Sanders

Mar 12 2008

Sean Kelly <sean invisibleduck.org> writes:

Nice work!


Sean

Mar 12 2008

Walter Bright <newshound1 digitalmars.com> writes:

Scott Sanders wrote:
 I have done some benchmarks of the D xml parsers alongside C/C++/Java
 parsers, and as you can see from the graphs, D is rocking with Tango!
 
 
 http://dotnot.org/blog/index.php


Reddit link: http://reddit.com/r/programming/info/6bt6n/comments/

Mar 12 2008

N/A <NA NA.na> writes:

== Quote from Scott Sanders (scott stonecobra.com)'s article
 I have done some benchmarks of the D xml parsers alongside C/C++/Java parsers,
and as you can see from the

graphs, D is rocking with Tango!
 http://dotnot.org/blog/index.php
 I wanted to post to let the D community know that good language and library
design can really make an

impact.
 As always, I am open to comments/changes/additions, etc.  I will be happy to
run any other project code

through the benchmark if someone submits a patch to me containing the code.
The charts look great.

I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
input is a char[], do you also plan on adding file streams as input?

N/A

Mar 12 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
 input is a char[], do you also plan on adding file streams as input?

I believe the suggested approach in this case is to access the input as a
memory mapped file.  This does
place some restrictions on file size in 32-bit applications, but then those are
ideally in decline.


Sean

Mar 12 2008

N/A <NA Na.na> writes:

== Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs


and I noticed that the
 input is a char[], do you also plan on adding file streams as


input?
 I believe the suggested approach in this case is to access the

input as a memory mapped file.  This does
 place some restrictions on file size in 32-bit applications, but

then those are ideally in decline.
 Sean

Any examples on how to approach this using Tango?

Cheers,
N/A

Mar 12 2008

Scott Sanders <scott stonecobra.com> writes:

N/A Wrote:

 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs


 and I noticed that the
 input is a char[], do you also plan on adding file streams as


 input?
 I believe the suggested approach in this case is to access the

 input as a memory mapped file.  This does
 place some restrictions on file size in 32-bit applications, but

 then those are ideally in decline.
 Sean

 
 Any examples on how to approach this using Tango?
 
 Cheers,
 N/A

Should be able to:

auto fc = new FileConduit ("test.txt");
auto buf = new MappedBuffer(fc);
auto doc = new Document!(char);
doc.parse(buf.getContent());

That should do it.

Mar 12 2008

N/A <NA NA.com> writes:

 Should be able to:
 auto fc = new FileConduit ("test.txt");
 auto buf = new MappedBuffer(fc);
 auto doc = new Document!(char);
 doc.parse(buf.getContent());
 That should do it.

Thanks,

I was wondering on how to do it using the PullParser.

Cheers

Mar 13 2008

Scott Sanders <scott stonecobra.com> writes:

N/A Wrote:

 
 Should be able to:
 auto fc = new FileConduit ("test.txt");
 auto buf = new MappedBuffer(fc);
 auto doc = new Document!(char);
 doc.parse(buf.getContent());
 That should do it.

 
 Thanks,
 
 I was wondering on how to do it using the PullParser.
 

PullParser is exactly the same, just swap Document!(char) with
PullParser!(char).

Scott

Mar 13 2008

BCS <BCS pathlink.com> writes:

Sean Kelly wrote:
 == Quote from N/A (NA NA.na)'s article
 
I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
input is a char[], do you also plan on adding file streams as input?

 
 
 I believe the suggested approach in this case is to access the input as a
memory mapped file.  This does
 place some restrictions on file size in 32-bit applications, but then those
are ideally in decline.
 
 
 Sean

what might be interesting is to make a version that works with slices of 
the file rather than ram. (make the current version into a template 
specialized on char[] and the new one on some new type?) That way only 
the parsed meta data needs to stay in ram. It would take a lot of games 
mapping stuff in and out of ram but it would be interesting to see if it 
could be done.

Mar 13 2008

Kris <foo bar.com> writes:

BCS Wrote:

 Sean Kelly wrote:
 == Quote from N/A (NA NA.na)'s article
 
I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
input is a char[], do you also plan on adding file streams as input?

 
 
 I believe the suggested approach in this case is to access the input as a
memory mapped file.  This does
 place some restrictions on file size in 32-bit applications, but then those
are ideally in decline.
 
 
 Sean

 
 what might be interesting is to make a version that works with slices of 
 the file rather than ram. (make the current version into a template 
 specialized on char[] and the new one on some new type?) That way only 
 the parsed meta data needs to stay in ram. It would take a lot of games 
 mapping stuff in and out of ram but it would be interesting to see if it 
 could be done.

It would be interesting, but isn't that kinda what memory-mapped files provides
for? You can operate with files up to 4GB in size (on a 32bit system), even
with DOM, where the slices are virtual addresses within paged file-blocks.
Effectively, each paged segment of the file is a lower-level slice?

Mar 13 2008

BCS <ao pathlink.com> writes:

Reply to kris,

 BCS Wrote:
 
 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.
 

 It would be interesting, but isn't that kinda what memory-mapped files
 provides for? You can operate with files up to 4GB in size (on a 32bit
 system), even with DOM, where the slices are virtual addresses within
 paged file-blocks. Effectively, each paged segment of the file is a
 lower-level slice?
 

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). 
on 32bits, you can't map in 4GB because you need space for the programs code 
(and on windows you only get 3GB of address space as the OS gets that last 
GB) Also what about a 10GB file? My idea is to make some sort of lib that 
lest you handle larges data sets (64bit?) You would ask for a file to be 
"mapped in" and then you would get an object that syntactically looks like 
an array. Indexes ops would actually map in pieces, slices would generate 
new objects (with ref to the parent) that would, on demand, map stuff in. 
Some sort of GCish thing would start un mapping/moving stings when space 
gets tight. If you never have to actual convert the data to a "real" array 
you don't ever need to copy the stuff, you can just leave it in the file. 
I'm not sure it's even possible or how it would work, but it would be cool. 
(and highly useful)

Mar 13 2008

"Kris" <foo bar.com> writes:

Reply to BCS:

"BCS" <ao pathlink.com> wrote in message 
news:55391cb32a6178ca5358fd65a320 news.digitalmars.com...
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

 It would be interesting, but isn't that kinda what memory-mapped files
 provides for? You can operate with files up to 4GB in size (on a 32bit
 system), even with DOM, where the slices are virtual addresses within
 paged file-blocks. Effectively, each paged segment of the file is a
 lower-level slice?

 Not as I understand it (I looked this up about a year ago so I'm a bit 
 rusty). on 32bits, you can't map in 4GB because you need space for the 
 programs code (and on windows you only get 3GB of address space as the OS 
 gets that last GB)

Doh. You're right, of course. Thank goodness for 64bit machines :)

Mar 13 2008

Alexander Panek <alexander.panek brainsware.org> writes:

BCS wrote:
 Reply to kris,
 
 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

 It would be interesting, but isn't that kinda what memory-mapped files
 provides for? You can operate with files up to 4GB in size (on a 32bit
 system), even with DOM, where the slices are virtual addresses within
 paged file-blocks. Effectively, each paged segment of the file is a
 lower-level slice?

 
 Not as I understand it (I looked this up about a year ago so I'm a bit 
 rusty). on 32bits, you can't map in 4GB because you need space for the 
 programs code (and on windows you only get 3GB of address space as the 
 OS gets that last GB) Also what about a 10GB file? My idea is to make 
 some sort of lib that lest you handle larges data sets (64bit?) You 
 would ask for a file to be "mapped in" and then you would get an object 
 that syntactically looks like an array. Indexes ops would actually map 
 in pieces, slices would generate new objects (with ref to the parent) 
 that would, on demand, map stuff in. Some sort of GCish thing would 
 start un mapping/moving stings when space gets tight. If you never have 
 to actual convert the data to a "real" array you don't ever need to copy 
 the stuff, you can just leave it in the file. I'm not sure it's even 
 possible or how it would work, but it would be cool. (and highly useful)


I've got this strange feeling in my stomach that shouts out "WTF?!" when 
I read about >3-4GB XML files. I know, it's about the "if" and "whens", 
but still; if you find yourself needing such a beast of an XML file, you 
might possibly think of other forms of data structuring (a database, 
perhaps?).

Mar 14 2008

"Koroskin Denis" <2korden+dmd gmail.com> writes:

On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek  
<alexander.panek brainsware.org> wrote:

 BCS wrote:
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

 It would be interesting, but isn't that kinda what memory-mapped files
 provides for? You can operate with files up to 4GB in size (on a 32bit
 system), even with DOM, where the slices are virtual addresses within
 paged file-blocks. Effectively, each paged segment of the file is a
 lower-level slice?

  Not as I understand it (I looked this up about a year ago so I'm a bit  
 rusty). on 32bits, you can't map in 4GB because you need space for the  
 programs code (and on windows you only get 3GB of address space as the  
 OS gets that last GB) Also what about a 10GB file? My idea is to make  
 some sort of lib that lest you handle larges data sets (64bit?) You  
 would ask for a file to be "mapped in" and then you would get an object  
 that syntactically looks like an array. Indexes ops would actually map  
 in pieces, slices would generate new objects (with ref to the parent)  
 that would, on demand, map stuff in. Some sort of GCish thing would  
 start un mapping/moving stings when space gets tight. If you never have  
 to actual convert the data to a "real" array you don't ever need to  
 copy the stuff, you can just leave it in the file. I'm not sure it's  
 even possible or how it would work, but it would be cool. (and highly  
 useful)


 I've got this strange feeling in my stomach that shouts out "WTF?!" when  
 I read about >3-4GB XML files. I know, it's about the "if" and "whens",  
 but still; if you find yourself needing such a beast of an XML file, you  
 might possibly think of other forms of data structuring (a database,  
 perhaps?).


It sounds strange, but even large companies like Google or Yahoo store  
their temporary search indexes in ULTRA large XML files, and many of them  
can easily be tens or even hundreds of GBs in size (just ordinary daily  
index) before they get "repacked" into compacter format.

Mar 14 2008

Alexander Panek <alexander.panek brainsware.org> writes:

Koroskin Denis wrote:
 On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek 
 <alexander.panek brainsware.org> wrote:
 I've got this strange feeling in my stomach that shouts out "WTF?!" 
 when I read about >3-4GB XML files. I know, it's about the "if" and 
 "whens", but still; if you find yourself needing such a beast of an 
 XML file, you might possibly think of other forms of data structuring 
 (a database, perhaps?).

 
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of 
 them can easily be tens or even hundreds of GBs in size (just ordinary 
 daily index) before they get "repacked" into compacter format.

That does, indeed, sound strange. :X

Mar 14 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of 
 them can easily be tens or even hundreds of GBs in size (just ordinary 
 daily index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...

Mar 14 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Robert Fraser" <fraserofthenight gmail.com> wrote in message 
news:freg27$1m7l$1 digitalmars.com...
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of them 
 can easily be tens or even hundreds of GBs in size (just ordinary daily 
 index) before they get "repacked" into compacter format.

 It's a shame the "O RLY?" owl died out years ago...

O RLY?

Good internet memes never die, they just go into hibernation ;)

Mar 14 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Robert Fraser wrote:
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of 
 them can easily be tens or even hundreds of GBs in size (just ordinary 
 daily index) before they get "repacked" into compacter format.

 
 It's a shame the "O RLY?" owl died out years ago...

SRSLY?

:P

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Mar 23 2008

Christopher Wright <dhasenan gmail.com> writes:

Bruno Medeiros wrote:
 Robert Fraser wrote:
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo 
 store their temporary search indexes in ULTRA large XML files, and 
 many of them can easily be tens or even hundreds of GBs in size (just 
 ordinary daily index) before they get "repacked" into compacter format.

 It's a shame the "O RLY?" owl died out years ago...

 
 SRSLY?
 
 :P
 

You know Sir Sly?

Mar 23 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Alexander Panek (alexander.panek brainsware.org)'s article
 BCS wrote:
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

 It would be interesting, but isn't that kinda what memory-mapped files
 provides for? You can operate with files up to 4GB in size (on a 32bit
 system), even with DOM, where the slices are virtual addresses within
 paged file-blocks. Effectively, each paged segment of the file is a
 lower-level slice?

 Not as I understand it (I looked this up about a year ago so I'm a bit
 rusty). on 32bits, you can't map in 4GB because you need space for the
 programs code (and on windows you only get 3GB of address space as the
 OS gets that last GB) Also what about a 10GB file? My idea is to make
 some sort of lib that lest you handle larges data sets (64bit?) You
 would ask for a file to be "mapped in" and then you would get an object
 that syntactically looks like an array. Indexes ops would actually map
 in pieces, slices would generate new objects (with ref to the parent)
 that would, on demand, map stuff in. Some sort of GCish thing would
 start un mapping/moving stings when space gets tight. If you never have
 to actual convert the data to a "real" array you don't ever need to copy
 the stuff, you can just leave it in the file. I'm not sure it's even
 possible or how it would work, but it would be cool. (and highly useful)

 I've got this strange feeling in my stomach that shouts out "WTF?!" when
 I read about >3-4GB XML files. I know, it's about the "if" and "whens",
 but still; if you find yourself needing such a beast of an XML file, you
 might possibly think of other forms of data structuring (a database,
 perhaps?).

It's quite possible that an XML stream could be used as the transport mechanism
for the result of a
database query.  In such an instance, I wouldn't be at all surprised if a
response were more than 3-4GB.
In fact, I've designed such a system and the proper query would definitely have
produced such a dataset.


Sean

Mar 14 2008

BCS <ao pathlink.com> writes:

Reply to Alexander,

 BCS wrote:
 
 Not as I understand it (I looked this up about a year ago so I'm a
 bit rusty). on 32bits, you can't map in 4GB because you need space
 for the programs code (and on windows you only get 3GB of address
 space as the OS gets that last GB) Also what about a 10GB file? My
 idea is to make some sort of lib that lest you handle larges data
 sets (64bit?) You would ask for a file to be "mapped in" and then you
 would get an object that syntactically looks like an array. Indexes
 ops would actually map in pieces, slices would generate new objects
 (with ref to the parent) that would, on demand, map stuff in. Some
 sort of GCish thing would start un mapping/moving stings when space
 gets tight. If you never have to actual convert the data to a "real"
 array you don't ever need to copy the stuff, you can just leave it in
 the file. I'm not sure it's even possible or how it would work, but
 it would be cool. (and highly useful)
 

 I've got this strange feeling in my stomach that shouts out "WTF?!"
 when I read about >3-4GB XML files. I know, it's about the "if" and
 "whens", but still; if you find yourself needing such a beast of an
 XML file, you might possibly think of other forms of data structuring
 (a database, perhaps?).
 

Truth be told, I'm not that far from agreeing with you (on seeing that I'd 
think: "WTF?!?!.... Um... OoooK.... well..."). I can't think of a justification 
for the lib I described if the only thing it would be used for would be a 
XML parser. It might be used for managing parts of something like... a database 
table. <G>

Mar 14 2008

D Programming

C/C++ Programming

Other

digitalmars.D - XML Benchmarks in D