www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - XML Benchmarks in D

reply Scott Sanders <scott stonecobra.com> writes:
I have done some benchmarks of the D xml parsers alongside C/C++/Java parsers,
and as you can see from the graphs, D is rocking with Tango!

http://dotnot.org/blog/index.php

I wanted to post to let the D community know that good language and library
design can really make an impact.

As always, I am open to comments/changes/additions, etc.  I will be happy to
run any other project code through the benchmark if someone submits a patch to
me containing the code.

And Walter, I am trying to use "D Programming Language" everywhere I can :)

Cheers,
Scott Sanders
Mar 12 2008
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Nice work!


Sean
Mar 12 2008
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Scott Sanders wrote:
 I have done some benchmarks of the D xml parsers alongside C/C++/Java
 parsers, and as you can see from the graphs, D is rocking with Tango!
 
 
 http://dotnot.org/blog/index.php

Reddit link: http://reddit.com/r/programming/info/6bt6n/comments/
Mar 12 2008
prev sibling parent reply N/A <NA NA.na> writes:
== Quote from Scott Sanders (scott stonecobra.com)'s article
 I have done some benchmarks of the D xml parsers alongside C/C++/Java parsers,
and as you can see from the

 http://dotnot.org/blog/index.php
 I wanted to post to let the D community know that good language and library
design can really make an

 As always, I am open to comments/changes/additions, etc.  I will be happy to
run any other project code

The charts look great. I generally handle files that are a few hundred MB to a few gigs and I noticed that the input is a char[], do you also plan on adding file streams as input? N/A
Mar 12 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
 input is a char[], do you also plan on adding file streams as input?

I believe the suggested approach in this case is to access the input as a memory mapped file. This does place some restrictions on file size in 32-bit applications, but then those are ideally in decline. Sean
Mar 12 2008
next sibling parent reply N/A <NA Na.na> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs


 input is a char[], do you also plan on adding file streams as


 I believe the suggested approach in this case is to access the

 place some restrictions on file size in 32-bit applications, but

 Sean

Any examples on how to approach this using Tango? Cheers, N/A
Mar 12 2008
parent reply Scott Sanders <scott stonecobra.com> writes:
N/A Wrote:

 == Quote from Sean Kelly (sean invisibleduck.org)'s article
 == Quote from N/A (NA NA.na)'s article
 I generally handle files that are a few hundred MB to a few gigs


 input is a char[], do you also plan on adding file streams as


 I believe the suggested approach in this case is to access the

 place some restrictions on file size in 32-bit applications, but

 Sean

Any examples on how to approach this using Tango? Cheers, N/A

auto fc = new FileConduit ("test.txt"); auto buf = new MappedBuffer(fc); auto doc = new Document!(char); doc.parse(buf.getContent()); That should do it.
Mar 12 2008
parent reply N/A <NA NA.com> writes:
 Should be able to:
 auto fc = new FileConduit ("test.txt");
 auto buf = new MappedBuffer(fc);
 auto doc = new Document!(char);
 doc.parse(buf.getContent());
 That should do it.

Thanks, I was wondering on how to do it using the PullParser. Cheers
Mar 13 2008
parent Scott Sanders <scott stonecobra.com> writes:
N/A Wrote:

 
 Should be able to:
 auto fc = new FileConduit ("test.txt");
 auto buf = new MappedBuffer(fc);
 auto doc = new Document!(char);
 doc.parse(buf.getContent());
 That should do it.

Thanks, I was wondering on how to do it using the PullParser.

Scott
Mar 13 2008
prev sibling parent reply BCS <BCS pathlink.com> writes:
Sean Kelly wrote:
 == Quote from N/A (NA NA.na)'s article
 
I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
input is a char[], do you also plan on adding file streams as input?

I believe the suggested approach in this case is to access the input as a memory mapped file. This does place some restrictions on file size in 32-bit applications, but then those are ideally in decline. Sean

what might be interesting is to make a version that works with slices of the file rather than ram. (make the current version into a template specialized on char[] and the new one on some new type?) That way only the parsed meta data needs to stay in ram. It would take a lot of games mapping stuff in and out of ram but it would be interesting to see if it could be done.
Mar 13 2008
parent reply Kris <foo bar.com> writes:
BCS Wrote:

 Sean Kelly wrote:
 == Quote from N/A (NA NA.na)'s article
 
I generally handle files that are a few hundred MB to a few gigs and I noticed
that the
input is a char[], do you also plan on adding file streams as input?

I believe the suggested approach in this case is to access the input as a memory mapped file. This does place some restrictions on file size in 32-bit applications, but then those are ideally in decline. Sean

what might be interesting is to make a version that works with slices of the file rather than ram. (make the current version into a template specialized on char[] and the new one on some new type?) That way only the parsed meta data needs to stay in ram. It would take a lot of games mapping stuff in and out of ram but it would be interesting to see if it could be done.

It would be interesting, but isn't that kinda what memory-mapped files provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?
Mar 13 2008
parent reply BCS <ao pathlink.com> writes:
Reply to kris,

 BCS Wrote:
 
 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.
 

provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)
Mar 13 2008
next sibling parent "Kris" <foo bar.com> writes:
Reply to BCS:

"BCS" <ao pathlink.com> wrote in message 
news:55391cb32a6178ca5358fd65a320 news.digitalmars.com...
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB)

Doh. You're right, of course. Thank goodness for 64bit machines :)
Mar 13 2008
prev sibling parent reply Alexander Panek <alexander.panek brainsware.org> writes:
BCS wrote:
 Reply to kris,
 
 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)

I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).
Mar 14 2008
next sibling parent reply "Koroskin Denis" <2korden+dmd gmail.com> writes:
On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek  
<alexander.panek brainsware.org> wrote:

 BCS wrote:
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?

rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)

I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).

It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.
Mar 14 2008
next sibling parent Alexander Panek <alexander.panek brainsware.org> writes:
Koroskin Denis wrote:
 On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek 
 <alexander.panek brainsware.org> wrote:
 I've got this strange feeling in my stomach that shouts out "WTF?!" 
 when I read about >3-4GB XML files. I know, it's about the "if" and 
 "whens", but still; if you find yourself needing such a beast of an 
 XML file, you might possibly think of other forms of data structuring 
 (a database, perhaps?).

It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.

That does, indeed, sound strange. :X
Mar 14 2008
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of 
 them can easily be tens or even hundreds of GBs in size (just ordinary 
 daily index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...
Mar 14 2008
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Robert Fraser" <fraserofthenight gmail.com> wrote in message 
news:freg27$1m7l$1 digitalmars.com...
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of them 
 can easily be tens or even hundreds of GBs in size (just ordinary daily 
 index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...

O RLY? Good internet memes never die, they just go into hibernation ;)
Mar 14 2008
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Robert Fraser wrote:
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo store 
 their temporary search indexes in ULTRA large XML files, and many of 
 them can easily be tens or even hundreds of GBs in size (just ordinary 
 daily index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...

SRSLY? :P -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Mar 23 2008
parent Christopher Wright <dhasenan gmail.com> writes:
Bruno Medeiros wrote:
 Robert Fraser wrote:
 Koroskin Denis wrote:
 It sounds strange, but even large companies like Google or Yahoo 
 store their temporary search indexes in ULTRA large XML files, and 
 many of them can easily be tens or even hundreds of GBs in size (just 
 ordinary daily index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...

SRSLY? :P

You know Sir Sly?
Mar 23 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Alexander Panek (alexander.panek brainsware.org)'s article
 BCS wrote:
 Reply to kris,

 BCS Wrote:

 what might be interesting is to make a version that works with slices
 of the file rather than ram. (make the current version into a
 template specialized on char[] and the new one on some new type?)
 That way only the parsed meta data needs to stay in ram. It would
 take a lot of games mapping stuff in and out of ram but it would be
 interesting to see if it could be done.

provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)

I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).

It's quite possible that an XML stream could be used as the transport mechanism for the result of a database query. In such an instance, I wouldn't be at all surprised if a response were more than 3-4GB. In fact, I've designed such a system and the proper query would definitely have produced such a dataset. Sean
Mar 14 2008
prev sibling parent BCS <ao pathlink.com> writes:
Reply to Alexander,

 BCS wrote:
 
 Not as I understand it (I looked this up about a year ago so I'm a
 bit rusty). on 32bits, you can't map in 4GB because you need space
 for the programs code (and on windows you only get 3GB of address
 space as the OS gets that last GB) Also what about a 10GB file? My
 idea is to make some sort of lib that lest you handle larges data
 sets (64bit?) You would ask for a file to be "mapped in" and then you
 would get an object that syntactically looks like an array. Indexes
 ops would actually map in pieces, slices would generate new objects
 (with ref to the parent) that would, on demand, map stuff in. Some
 sort of GCish thing would start un mapping/moving stings when space
 gets tight. If you never have to actual convert the data to a "real"
 array you don't ever need to copy the stuff, you can just leave it in
 the file. I'm not sure it's even possible or how it would work, but
 it would be cool. (and highly useful)
 

when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).

Truth be told, I'm not that far from agreeing with you (on seeing that I'd think: "WTF?!?!.... Um... OoooK.... well..."). I can't think of a justification for the lib I described if the only thing it would be used for would be a XML parser. It might be used for managing parts of something like... a database table. <G>
Mar 14 2008