Snippet for SAX parsing a user object in C++? - c++

Can anyone share a snippet of code where they parsed a user defined object using SAX parser in C++.

SAX stands for Simple API for XML.
SAX is for parsing XML files only. So if you want to parse an object from C++ using SAX, the only way this makes sense is to serialize that object into XML first.
The reason you would want to use a SAX XML parser though is because you may have a very large XML file, and you want to read it in parts only, by doing this you use less RAM.
A good exmaple of why you'd want to use a SAX parser is if you have a serizlied std::vector of strings into an XML file. Let's say this vector contains millions of entries. Then you can't read the entire XML file into memory at once. You would have to instead use a SAX parser.
There are many different implementations of a SAX parser, but you can even create one yourself pretty easily drawing a finite state machine. As you get into a state of reading a full element you call a callback function.
You can also use LibXML, MSXML, Xerces-C++ XML Parser, and many more libraries for SAX parsing.
Getting back to your original question. Your SAX parser would simply parse the XML representation of your object, then fill the object's members. Or if your object is a container, add the container's elements.

Related

Clojure design philosophy of functions vs methods

It's often quoted that
"It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." — Alan Perlis
and this is something that is heavily observed in Clojure & it's libraries too e.g
In the gist here, Rich has explained the design principle of clojure.xml where once the XML file is read, it is converted to a map and then you have all the functions available to manipulate these maps however you need and that you can reuse these functions with other maps
I'm having trouble understanding how these functions implemented to manipulate maps representing XML's to be used anywhere else?
I mean wouldn't the 100 functions I would've written be specific to the XML domain (i.e. specific to the map schema for XML), so the only reuse of these fn's would be if the map adheres to the same schema
Am I missing anything?
From the linked gist:
Look at the way clojure.xml processes XML - it deals with an external entity, a SAX parser fronting some XML source, gets the data out of it and into a map, and it's done. There aren't and won't be many more functions specific to XML. The world of functions for manipulating maps will now apply. And if it becomes necessary to have more XQuery-like capabilities in order to handle XML applications, they will be built for maps generally, and be available and reusable on all maps. Contrast this with your typical XML DOM, a huge collection of functions useful nowhere else.
Wouldn't the 100 functions I would've written be specific to the XML domain?
No.
Because you would not be dealing with XML anymore. XML is merely a markup language. In other language ecosystems (say, Java) when someone refers to XML, it means the markup language AND (almost always) it's extensions and the tools built around them. The way Clojure (at least clojure.xml) handles XML is to read the data and then continue processing it in Clojure's native maps.
It is worth noting (as of now) clojure.xml contains only one function, parse. This means it does not support serializing to XML, neither does it support in-place editing of XML data elements.
I think you are confusing XML with the particular instance of an XML file.
Given a specific XML file like say:
<xml>
<node1>One</node1>
<node2>Two</node1>
<node3>
<node31>
ThreeOne
</node31>
<node32>
ThreeTwo
</node31>
</node1>
</xml>
It is true that for example, you might want to have logic that grabs the values form the child nodes under node3. And so maybe you'd have a function called get-three-node-child-values. That function would not be reusable given another XML with a different structure that did not have a node3 with children as the above XML does.
But this is not the functions Rich is talking about being reused. The functions being reused are the ones that you use to implement the logic of get-three-node-child-values. Because if what you had was an XML Object, that XML Object would need to have a method to get the node3, and another method to get children of that node, and another method to get the value of that node. All these methods only work for the XML Class of objects, and had to be written for that. But if you turn the XML into a Map, you don't need these methods at all, and don't need to implement them. Since a Map already has methods to navigate and loop over its nodes.
Hope that made it more clear.

C++ XML parsing library for large files with low RAM consumpion

Which XML library for C++ uses the lowest amount of RAM while parsing e. g. 300M file? Ideally the choice should be restricted to one of RapidXml, Pugixml, Libxml, Boost, TinyXML.
You haven't clarified your complete requirements. There are 2 commonly used xml parsing models: DOM and SAX. In DOM the entire file is parsed into memory as a tree where as SAX is more of a event driven library. If you are
not planning to modify the XML & just consume it and
concerned about RAM usage
then using SAX model will be optimum. I will be surprised if half decent implementation of SAX is memory intensive. Also look at this post: Light weight C++ SAX XML parser

Load partial xml using xerces dom parser

The xml file that i get is huge in size, however while i need only specific parts of the file in random manner (hence cannot use SAX) while processing.
Is there any way by which i can load only a partial dom tree in memory using xerces dom parser?
It sounds like what you want is something like Python's pulldom which Xerces does not offer.
If you are beholden to Xerces and memory is a primary concern, you could use Xerces SAX (push) parser to populate a data structure with only the data from the XML that you care about. Then you could "randomly" access the data that you are interested in.
If you are free to use other libraries, you might look into a StAX (pull) parser. Although, I think you will still have to implement your own data structure to hold the data you're interested in. I'm not aware of a C++ equivalent of Python's pulldom.

What's the difference between QXml and QDom?

In Qt there are a number of different ways to work with XML. To keep this simple I only want to look at the QXml* classes and the QDom* classes.
I'm trying to figure out which one to use but they both look to have similar functionality.
What's are the main differences between QXml and QDom?
Hypothetical example: Does one read the whole xml file into memory making it slow at startup but faster after startup?
What scenarios should require you to you to use one method over the other? and why should you use one over the other?
Hypothetical example: let's say you you are doing a "one-pass" versus "multi-pass"...
In short, QXml* classes implement SAX (Simple API for XML) XML parser while QDom* implement DOM (Document Object Model) XML parser.
The main difference is that SAX is a sequential access parser, so it parses the document as it reads it, and makes first chunks of parsed data available almost instantly. DOM needs to load the whole document into the memory to get it parsed, but it might be a bit easier to handle in terms of code overhead (for SAX you have to implement XML handler class). In general, SAX is more lightweight and faster.
There's lots of reading online regarding comparison of SAX and DOM:
why is sax parsing faster than dom parsing ? and how does stax work?
http://developerlife.com/tutorials/?p=28
And here's a nice document comparing various multiplatform XML parsers (including QXml* and QDom*). Your best choice depends on your use case, if you're working with huge XML documents, you'd prefer SAX. For tiny XMLs you'd be better off using DOM, since it's just a few lines of code to get data you need from a file.

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"