High performance XML parsing in C++

High performance XML parsing in C++ - c++

Well a lot of questions have been made about parsing XML in C++ and so on...
But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse.
My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?

You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.

Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.

SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.

Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.

If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.

I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.

Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.

someone say QtXML module is high performance for huge XML files.

Related

Which libxml2 API should I use for large files?

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.
It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.
I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?
Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.

I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...

If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

Line by line parsing a huge XML file by a light weight parser

I'm writing a small stand alone tool for Linux which needs read a huge xml-file.
The xml-file has simple structure and a progressive or streaming (line by line) parser is suitable for it.
I want to use a light-weight class library such as TinyXML but I don't know it supports progressive parsing or not?!
If the answer is "yes", Do you have a sample? And, if the answer is "no", Do you know another alternative for it which is small and header only class library?
Update: How about RapidXML or pugiXML?

Sounds like libxml's XmlReader interface is just what you want. Fast, simple, and streaming. Light-weight and XML don't mix, unfortunately. I prefer XmlReader's pull model to SAX's push model, but they'll both do what you want.
In the pull model, you call a function and get a new node, then check yourself if it matches. In the push model, you supply callbacks and SAX calls them as it finds nodes matching them.
TinyXML, last I checked, is not standards-compliant -- I would avoid it.

How to start using xml with C++

(Not sure if this should be CW or not, you're welcome to comment if you think it should be).
At my workplace, we have many many different file formats for all kinds of purposes. Most, if not all, of these file formats are just written in plain text, with no consistency. I'm only a student working part-time, and I have no experience with using xml in production, but it seems to me that using xml would improve productivity, as we often need to parse, check and compare these outputs.
So my questions are: given that I can only control one small application and its output (only - the inputs are formats that are used in other applications as well), is it worth trying to change the output to be xml-based? If so, what are the best known ways to do that in C++ (i.e., xml parsers/writers, etc.)? Also, should I also provide a plain-text output to make it easy for the users (which are also programmers) to get used to xml? Should I provide a script to translate xml-plaintext? What are your experiences with this subject?
Thanks.

Don't just use XML because it's XML.
Use XML because:
other applications (that only accept XML) are going to read your output
you have an hierarchical data structure that lends itself perfectly for XML
you want to transform the data to other formats using XSL (e.g. to HTML)
EDIT:
A nice personal experience:
Customer: your application MUST be able to read XML.
Me: Er, OK, I will adapt my application so it can read XML.
Same customer (a few days later): your application MUST be able to read fixed width files, because we just realized our mainframe cannot generate XML.

Amir, to parse an XML you can use TinyXML which is incredibly easy to use and start with. Check its documentation for a quick brief, and read carefully the "what it does not do" clause. Been using it for reading and all I can say is that this tiny library does the job, very well.
As for writing - if your XML files aren't complex you might build them manually with a string object. "Aren't complex" for me means that you're only going to store text at most.
For more complex XML reading/writing you better check Xerces which is heavier than TinyXML. I haven't used it yet I've seen it in production and it does deliver it.

You can try using the boost::property_tree class.
http://www.boost.org/doc/libs/1_43_0/doc/html/property_tree.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/tutorial.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser
It's pretty easy to use, but the page does warn that it doesn't support the XML format completely. If you do use this though, it gives you the freedom to easily use XML, INI, JSON, or INFO files without changing more than just the read_xml line.
If you want that ability though, you should avoid xml attributes. To use an attribute, you have to look at the key , which won't transfer between filetypes (although you can manually create your own subnodes).
Although using TinyXML is probably better. I've seen it used before in a couple of projects I've worked on, but don't have any experience with it.

Another approach to handling XML in your application is to use a data binding tool, such as CodeSynthesis XSD. Such a tool will generate C++ classes that hide all the gory details of parsing/serializing XML -- all that you see are objects corresponding to your XML vocabulary and functions that you can call to get/set the data, for example:
Person p = person ("person.xml");
cout << p.name ();
p.name ("John");
p.age (30);
ofstream ofs ("person.xml");
person (ofs, p);

Here's what previous SO threads have said on the topic. Please add others you know of that are relevant:
What is the best open XML parser for C++?
What is XML good for and when should i be using it?
What are good alternative data formats to XML?

BTW, before you decide on an XML parser, you may want to make sure that it will actually be able to parse all XML documents instead of just the "simple" ones, as discussed in this article:
Are you using a real XML parser?

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.

Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.

Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.

The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.

If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.

XML Parsing Problem

I have an XML parser that crashes on incomplete XML data. So XML data fed to it could be one of the following:
<one><two>twocontent</two</one>
<a/><b/> ( the parser treats it as two root elements )
Element attributes are also handled ( though not shown above ).
Now, the problem is when I read data from socket I get data in fragments. For example:
<one>one
content</two>
</one>
Thus, before sending the XML to the parser I have to construct a valid XML and send it.
What programming construct ( like iteration, recursion etc ) would be the best fit for this kind of scenario.
I am programming in C++.
Please help.

Short answer: You're doing it wrong.
Your question confuses two separate issues:
Parsing of data that is not well-formed XML at all, i.e. so-called tag soup.
Example: Files generated by programmers who do not understand XML or have lousy coding practices.
It is not unfair to say: A file that is not well-formed XML is not an XML document at all. Every correct XML parser will reject it. Ideally you would work to correct the source of this data and make sure that proper XML is generated instead.
Alternatively, use a tag soup parser, i.e. a parser that does error correction.
Useful tag soup parsers are often actually HTML parsers. tidy has already been pointed out in another answer.
Make certain that you understand what correction steps such a parser actually performs, since there is no universal approach that could fix XML. Tidy in particular is very aggressive at "repairing" the data, more aggressive than real browsers and the HTML 5 spec, for example.
XML parsing from a socket, where data arrives chunk-by-chunk in a stream. In this situation, the XML document might be viewed as "infinite", with chunks being processed as the appear, long before a final end tag for the root element has been seen.
Example: XMPP is a protocol that works like this.
The solution is to use a pull-based parser, for example the XMLTextReader API in libxml2.
If a tree-based data structure for the XML child elements being parser is required, you can build a tree structure for each such element that is being read, just not for the entire document.

What is feeding you the XML from the other end of the socket connection? It doesn't make sense that you should be missing stuff, as you illustrate, just because you receive it from a socket.
If the socket is using TCP (or a custom protocol with similar properties), you should not be missing parts of your XML. Thus, you should be able to just buffer it all until the other end signals "end of document", and then feed it to your picky XML parser.
If you are using UDP or some other "lossy" protocol, you need to reconsider, since it's obviously not possible to correctly transfer a large XML document over a channel that randomly drops pieces.

Because the XML structure is a hierarchic structure (a tree) a recursion would be the best way to approach this.
You can call the recursion on each child and fix the missing XML identifiers.
Basically, you'll be doing the same thing a DOM object parser would do, only you'll parse the file in order to fix it's structure.
One thing though, it seems to me as if in this method you are going to re-write the XML parser. Isn't it a waist of time?
Maybe it's better to find a way for the XML to arrive in the right structure rather than trying to fix it.

Are there multiple writers? Why isn't your parser validating the XML?
Use a tree, where every node represents an element and carries with it a dirty bit. The first occurrence of the node marks it as dirty i.e. you are expecting a closing tag, unless of course the node is of the form <a/>. Also, the first element, you encounter is the root.
When you hit a dirty node, keep pushing nodes in a stack, until you hit the closing tag, when you pop the contents.

In your example, how are you going to figure out exactly where in the content to put the opening <two> tag once you have detected it is missing? This is, as they say, non-trivial.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js