www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - XML Parsing

reply "Chris Pons" <cmpons gmail.com> writes:
Hey Guys,
I am trying to parse an XML document with std.xml. I've looked 
over the reference of std.xml as well as the example but i'm 
still stuck. I've also looked over some example code, but it's a 
bit confusing and doesn't entirely help explain what i'm doing 
wrong.

As far as I understand it, I should load a file with read in 
std.file and save that into a string. From there, I check to make 
sure the string xmlData is in a proper xml format.

This is where it gets a bit confusing, I followed the example and 
created a new instance of the class document parser and then 
tried to parse an attribute from the start tag map. The value i'm 
targeting right now is the width of the map in tiles, and want to 
save this into an integer. However, the value I get is 0.

Any help would be MUCH appreciated.

Here is a reference to the XML file: http://pastebin.com/tpUU1Wtv


//These two functions are called in my main loop.
	void LoadMap(string filename)
	{
		enforce( filename != "" , "Filename is invalid!" );

		xmlData = cast(string) read(filename);

		enforce( xmlData != "", "Read file Failed!" );

		debug StopWatch sw = StopWatch(AutoStart.yes);
		check(xmlData);
		debug writeln( "Verified XML in ", sw.peek.msecs, "ms.");		
	}
	
	void ParseMap()
	{
		auto xml = new DocumentParser(xmlData);

		xml.onStartTag["map"] = (ElementParser xml)
		{
			mapWidth = to!int(xml.tag.attr["width"]);
			xml.parse();
		};
		xml.parse();
		writeln("Map Width: ", mapWidth);
	}
Mar 19 2012
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
I know very little about std.xml (I looked at it and
said 'meh' and wrote my own lib), but my lib
makes this pretty simple.

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

grab dom.d and characterencodings.d

This has a bit of an html bias, but it works for xml too.

===
import arsd.dom;
import std.file;
import std.stdio;
import std.conv;

void main() {
	auto document = new Document(readText("test12.xml"), true, true);

	auto map = document.requireSelector("map");

	writeln(to!int(map.width), "x", to!int(map.height));

	foreach(tile; document.getElementsByTagName("tile"))
		writeln(tile.gid);
}
===

$ dmd test12.d dom.d characterencodings.d
$ test12
25x19
<snip tile data>





Let me explain the lines:

	auto document = new Document(readText("test12.xml"), true, true);

We use std.file.readText to read the file as a string. Document's
constructor is: (string data, bool caseSensitive, bool 
strictMode).

So, "true, true" means it will act like an XML parser, instead of
trying to correct for html tag soup.


Now, document is a DOM, like you see in W3C or web browsers
(via javascript), though it is expanded with a lot of convenience
and sugar.

	auto map = document.requireSelector("map");

querySelector and requireSelector use CSS selector syntax
to fetch one element. querySelector may return null, whereas
requireSelector will throw an exception if the element is not
found.

You can learn more about CSS selector syntax on the web. I tried
to cover a good chunk of the standard, including most css2 and 
some
css3.

Here, I'm asking for the first element with tag name "map".


You can also use querySelectorAll to get all the elements that
match, returned as an array, which is great for looping.

	writeln(to!int(map.width), "x", to!int(map.height));


The attributes on an element are exposed via dot syntax,
or you can use element.getAttribute("name") if you
prefer.

They are returned as strings. Using std.conv.to, we can
easily convert them to integers.


	foreach(tile; document.getElementsByTagName("tile"))
		writeln(tile.gid);

And finally, we get all the tile tags in the document and
print out their gid attribute.

Note that you can also call the element search functions
on individual elements. That will only return that
element and its children.



Here, you didn't need it, but you can also use
element.innerText to get the text inside a tag,
pretty much covering basic data retrieval.




Note: my library is not good at handling huge files;
it eats a good chunk of memory and loads the whole
document at once. But, it is the easiest way I've
seen (I'm biased though) to work with xml files,
so I like it.
Mar 19 2012
next sibling parent "Chris Pons" <cmpons gmail.com> writes:
On Tuesday, 20 March 2012 at 04:32:13 UTC, Adam D. Ruppe wrote:
 I know very little about std.xml (I looked at it and
 said 'meh' and wrote my own lib), but my lib
 makes this pretty simple.

 https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

 grab dom.d and characterencodings.d

 This has a bit of an html bias, but it works for xml too.

 ===
 import arsd.dom;
 import std.file;
 import std.stdio;
 import std.conv;

 void main() {
 	auto document = new Document(readText("test12.xml"), true, 
 true);

 	auto map = document.requireSelector("map");

 	writeln(to!int(map.width), "x", to!int(map.height));

 	foreach(tile; document.getElementsByTagName("tile"))
 		writeln(tile.gid);
 }
 ===

 $ dmd test12.d dom.d characterencodings.d
 $ test12
 25x19
 <snip tile data>





 Let me explain the lines:

 	auto document = new Document(readText("test12.xml"), true, 
 true);

 We use std.file.readText to read the file as a string. 
 Document's
 constructor is: (string data, bool caseSensitive, bool 
 strictMode).

 So, "true, true" means it will act like an XML parser, instead 
 of
 trying to correct for html tag soup.


 Now, document is a DOM, like you see in W3C or web browsers
 (via javascript), though it is expanded with a lot of 
 convenience
 and sugar.

 	auto map = document.requireSelector("map");

 querySelector and requireSelector use CSS selector syntax
 to fetch one element. querySelector may return null, whereas
 requireSelector will throw an exception if the element is not
 found.

 You can learn more about CSS selector syntax on the web. I tried
 to cover a good chunk of the standard, including most css2 and 
 some
 css3.

 Here, I'm asking for the first element with tag name "map".


 You can also use querySelectorAll to get all the elements that
 match, returned as an array, which is great for looping.

 	writeln(to!int(map.width), "x", to!int(map.height));


 The attributes on an element are exposed via dot syntax,
 or you can use element.getAttribute("name") if you
 prefer.

 They are returned as strings. Using std.conv.to, we can
 easily convert them to integers.


 	foreach(tile; document.getElementsByTagName("tile"))
 		writeln(tile.gid);

 And finally, we get all the tile tags in the document and
 print out their gid attribute.

 Note that you can also call the element search functions
 on individual elements. That will only return that
 element and its children.



 Here, you didn't need it, but you can also use
 element.innerText to get the text inside a tag,
 pretty much covering basic data retrieval.




 Note: my library is not good at handling huge files;
 it eats a good chunk of memory and loads the whole
 document at once. But, it is the easiest way I've
 seen (I'm biased though) to work with xml files,
 so I like it.
Thank you. I'll check it out.
Mar 20 2012
prev sibling parent reply "Iain" <staffell gmail.com> writes:
On Tuesday, 20 March 2012 at 04:32:13 UTC, Adam D. Ruppe wrote:
 I know very little about std.xml (I looked at it and
 said 'meh' and wrote my own lib), but my lib
 makes this pretty simple.

 https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

 grab dom.d and characterencodings.d

 This has a bit of an html bias, but it works for xml too.

 ===
 import arsd.dom;
 import std.file;
 import std.stdio;
 import std.conv;

 void main() {
 	auto document = new Document(readText("test12.xml"), true, 
 true);

 	auto map = document.requireSelector("map");

 	writeln(to!int(map.width), "x", to!int(map.height));

 	foreach(tile; document.getElementsByTagName("tile"))
 		writeln(tile.gid);
 }
 ===

 $ dmd test12.d dom.d characterencodings.d
 $ test12
 25x19
 <snip tile data>
Hi Adam, I'm also interested in your solution, as the std.xml page is so sparsely documented I can't make head nor tail of it. Also, neither of the examples compile for me, making life that little bit harder! Sadly, I can't get your code working either! I have downloaded the folder zip from your github link, and extracted it so that all the .d files are living in C:\D\dmd2\src\phobos\arsd\ If I try to compile the code you gave above, I get a pile of linking errors using D 2.059: C:\D\dmd2\windows\bin\dmd.exe parseSpain -O OPTLINK (R) for Win32 Release 8.00.12 Copyright (C) Digital Mars 1989-2010 All rights reserved. http://www.digitalmars.com/ctg/optlink.html parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom12__ModuleInfoZ parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom8__assertFiZv parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom24ElementNotFoundException7__ClassZ parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom24ElementNotFoundException6__ctorMFAyaAya AyaiZC4arsd3dom24ElementNotFoundException parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom8Document6__ctorMFAyabbZC4arsd3dom8Docume nt parseSpain.obj(parseSpain) Error 42: Symbol Undefined _D4arsd3dom8Document7__ClassZ --- errorlevel 6 Do you have any idea what's going on?!
May 18 2012
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 18 May 2012 at 23:08:59 UTC, Iain wrote:
 If I try to compile the code you gave above, I get a pile of 
 linking errors using D 2.059:
You have to link in the modules too on the command line dmd.exe parseSpain arsd/dom.d arsd/characterencoding.d (or whatever the full path to the modules is)
May 18 2012
parent reply "Iain" <staffell gmail.com> writes:
On Friday, 18 May 2012 at 23:16:26 UTC, Adam D. Ruppe wrote:
 On Friday, 18 May 2012 at 23:08:59 UTC, Iain wrote:
 If I try to compile the code you gave above, I get a pile of 
 linking errors using D 2.059:
You have to link in the modules too on the command line dmd.exe parseSpain arsd/dom.d arsd/characterencoding.d (or whatever the full path to the modules is)
Aah thank you! Finally, an XML parser that works in D!!!
May 18 2012
parent reply "Iain" <staffell gmail.com> writes:
On Friday, 18 May 2012 at 23:31:05 UTC, Iain wrote:
 Aah thank you!  Finally, an XML parser that works in D!!!
Adam, thanks for this! I guess you don't need much documentation for your code, as you can just look up the wealth of tutorials that have been written for Javascript's XML parser. I have re-jigged one of std.xml's examples as follows - and it works! If there were a vote (and there probably should be) I would suggest your code ought to replace std.xml. How can D be taken seriously when it has major parts of the standard library broken? /* * read all the titles from book.xml * * uses dom.d and characterencodings.d by alex d ruppe: * https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff */ import arsd.dom; import std.file; import std.stdio; import std.conv; void main() { // http://msdn2.microsoft.com/en-us/library/ms762271(VS.85).aspx auto document = new Document(readText("book.xml"), true, true); auto map = document.requireSelector("catalog"); foreach (book; document.getElementsByTagName("book")) { string title = book.getElementsByTagName("title")[0].innerText(); writeln(title); } }
May 18 2012
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 19 May 2012 at 00:00:50 UTC, Iain wrote:
 I guess you don't need much documentation for your code, as
 you can just look up the wealth of tutorials that have been 
 written for Javascript's XML parser.
Yeah, that's basically how I feel about it. I started writing some documentation but haven't gotten around to finishing it yet. But, if you know Javascript, you can probably get work done with my thing too.
 If there were a vote (and there probably should be) I would 
 suggest your code ought to replace std.xml.
This has come up before and some people are for it, but my code isn't built for speed or memory efficiency, so it isn't right for everybody.
May 18 2012