What's the most commonly used XML library for C++? [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 9 months ago.
Improve this question
I saw a few libraries through a quick Google search. What's generally the most commonly used XML implementation for C++?
I'm planning on using XML as a means for program configuration. I liked XML because I'll be making use of its tree-like structure. If you think you have a more suitable solution for this, feel free to mention it. I want something lightweight and simple. Maybe XML is too much?
Edit: Cross-platform would be preferable, but to answer a question, I'm programming this in Linux.

See if TinyXML helps you
TinyXML is a simple, small, C++ XML parser that can be easily integrating into other programs.

There are several out there:
Xercers Big http://xerces.apache.org/xerces-c/
expat small http://expat.sourceforge.net/
I like expat. But that's a totally personal opinion.
I use it because it is small and it was simple to write a C++ wrapper for.
Xerces is like the full blown XML parser with all the knobs a whistles.
But consequently it is slightly more complex to use.

I would recommend not using XML.
I know this is a matter of opinion but XML really clutters the information with a lot of tags. Also, even though it is human-readable, the clutter actually hampers readability (and I say it from experience since we have some 134 XML configuration files at the moment...). Furthermore, it is quite difficult to read because of the mix between attributes and plain-text. You never know which one you are going to need.
I would recommend using JSON, if you want a language that already has well-defined parsers.
For parsing, a simple look at json.org and you have a long list of C++ libraries.

Not quite the question you asked, but there are two major flavors of XML parsers, SAX and DOM.
SAX parsers are event driven parsers. As the parser sees various elements with the XML document (node, properties, etc.), the parser calls some function or method that you have defined.
DOM parsers on the other hand parse the entire XML document and return a tree structure that represents the entire document. Your code can then poke through the structure in any order it sees.
SAX parsers are more memory efficient because they do not need to represent the entire document in memory. DOM parsers are easier to work with because you are not limited to processing the document in a linear fashion.

The XML libraries I've used and are still using are:
http://xmlsoft.org/
xerces / expat
Xalan-C
If you don't need to use XML then I would suggest not doing so.
I would also avoid modelling what you are reading/writing as C++ classes unless you are using a code generator.
I would also look at using a 'schema to code' generating for reading/writing, though make sure that the licence fits what you are doing.

I highly recommend pugixml
"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."
I have tested a few XML parsers including a few commercial ones before choosing and using pugixml in a commercial product.
pugixml was not only the fastest (sometimes a few times faster) parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.
The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!
DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.
It is small and fast parser. It is good choice for iOS or Android app if you do not mind linking C++ code.
I also tested TinyXML. It was not only slower but it had problems with my XML files.
Benchmarks tell a lot:
http://pugixml.org/benchmark.html

Related

How to read xml file in c++ [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?
Note: This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.
Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:
So the first question is this: What do you need?
I Need Full XML Compliance
OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.
The next question is: Does your API need to conform to DOM or SAX?
I Need Exact DOM and/or SAX Conformance
OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.
You have chosen:
Xerces
That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.
It has no real dependencies. It uses the Apache license.
I Don't Care About DOM and/or SAX Conformance
You have chosen:
LibXML2
LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.
It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.
It uses the MIT license.
I Do Not Need Full XML Compliance
OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.
So what does matter to you? The next question is: What is the most important thing to you in your XML work?
Maximum XML Parsing Performance
Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.
You have chosen:
RapidXML
This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.
Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.
Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).
RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.
One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.
It uses the MIT licence. It is a header-only library with no dependencies.
There is a RapidXML "GitHub patch" that allows it to also work with namespaces.
I Care About Performance But Not Quite That Much
Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.
You have chosen:
PugiXML
Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.
PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.
But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.
Reading Huge Documents
You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.
You have chosen:
LibXML2
Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.
LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.
Alternatives
Expat
Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.
It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).
LlamaXML
It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.
But it hasn't been updated since 2005. So again, Caveat Emptor.
XPath Support
XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.
There are effectively three choices here:
LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.
Just Get The Job Done
So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.
You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.
You have chosen:
TinyXML
I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.
Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.
There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.
TinyXML uses the zLib license, which is more or less the MIT License with a different name.
There is another approach to handling XML that you may want to consider, called XML
data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.
XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).
See, for example, an open-source XML data binding implementation that I wrote,
CodeSynthesis XSD and, for a
lighter-weight, dependency-free version, CodeSynthesis
XSD/e.
One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.
Ok then. I've created new one, since none of the list wasn't statisfy my needs.
Benefits:
Pull parser Streaming API i.e. parser is like iterator no callback or DOM tree. I.e. reading XML to data structures
Exceptions and RTTI can be off by compiler options, error handling can be done over std::error_code
Limit for memory usage, support for large files (tested with 100 mib XMark file from, speed depends on hardware). There is an example for limited COLLADA format 3D model loading
UNICODE support, and auto-detecting for input source encoding
Project home
Put mine as well.
http://www.codeproject.com/Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML
No XML validation features, but fast.
In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.
Here is an example:
rapidxml::xml_document<char> doc;
doc.parse<0>(xmlData);
rapidxml::xml_node<char>* root = doc.first_node();
rapidxml::xml_node<char>* node_account = 0;
if (GetNodeByElementName(root, "Account", &node_account) == true)
{
rapidxml::xml_node<char>* node_default = 0;
if (GetNodeByElementName(node_account, "default", &node_default) == true)
{
swprintf(result, 100, L"%hs", node_default->value());
free(xmlData);
return true;
}
}
free(xmlData);

What is the most efficient way to parse large xml files? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?
Note: This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.
Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:
So the first question is this: What do you need?
I Need Full XML Compliance
OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.
The next question is: Does your API need to conform to DOM or SAX?
I Need Exact DOM and/or SAX Conformance
OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.
You have chosen:
Xerces
That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.
It has no real dependencies. It uses the Apache license.
I Don't Care About DOM and/or SAX Conformance
You have chosen:
LibXML2
LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.
It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.
It uses the MIT license.
I Do Not Need Full XML Compliance
OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.
So what does matter to you? The next question is: What is the most important thing to you in your XML work?
Maximum XML Parsing Performance
Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.
You have chosen:
RapidXML
This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.
Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.
Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).
RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.
One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.
It uses the MIT licence. It is a header-only library with no dependencies.
There is a RapidXML "GitHub patch" that allows it to also work with namespaces.
I Care About Performance But Not Quite That Much
Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.
You have chosen:
PugiXML
Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.
PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.
But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.
Reading Huge Documents
You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.
You have chosen:
LibXML2
Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.
LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.
Alternatives
Expat
Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.
It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).
LlamaXML
It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.
But it hasn't been updated since 2005. So again, Caveat Emptor.
XPath Support
XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.
There are effectively three choices here:
LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.
Just Get The Job Done
So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.
You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.
You have chosen:
TinyXML
I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.
Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.
There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.
TinyXML uses the zLib license, which is more or less the MIT License with a different name.
There is another approach to handling XML that you may want to consider, called XML
data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.
XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).
See, for example, an open-source XML data binding implementation that I wrote,
CodeSynthesis XSD and, for a
lighter-weight, dependency-free version, CodeSynthesis
XSD/e.
One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.
Ok then. I've created new one, since none of the list wasn't statisfy my needs.
Benefits:
Pull parser Streaming API i.e. parser is like iterator no callback or DOM tree. I.e. reading XML to data structures
Exceptions and RTTI can be off by compiler options, error handling can be done over std::error_code
Limit for memory usage, support for large files (tested with 100 mib XMark file from, speed depends on hardware). There is an example for limited COLLADA format 3D model loading
UNICODE support, and auto-detecting for input source encoding
Project home
Put mine as well.
http://www.codeproject.com/Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML
No XML validation features, but fast.
In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.
Here is an example:
rapidxml::xml_document<char> doc;
doc.parse<0>(xmlData);
rapidxml::xml_node<char>* root = doc.first_node();
rapidxml::xml_node<char>* node_account = 0;
if (GetNodeByElementName(root, "Account", &node_account) == true)
{
rapidxml::xml_node<char>* node_default = 0;
if (GetNodeByElementName(node_account, "default", &node_default) == true)
{
swprintf(result, 100, L"%hs", node_default->value());
free(xmlData);
return true;
}
}
free(xmlData);

How to read an xhtml file in c++? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?
Note: This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.
Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:
So the first question is this: What do you need?
I Need Full XML Compliance
OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.
The next question is: Does your API need to conform to DOM or SAX?
I Need Exact DOM and/or SAX Conformance
OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.
You have chosen:
Xerces
That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.
It has no real dependencies. It uses the Apache license.
I Don't Care About DOM and/or SAX Conformance
You have chosen:
LibXML2
LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.
It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.
It uses the MIT license.
I Do Not Need Full XML Compliance
OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.
So what does matter to you? The next question is: What is the most important thing to you in your XML work?
Maximum XML Parsing Performance
Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.
You have chosen:
RapidXML
This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.
Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.
Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).
RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.
One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.
It uses the MIT licence. It is a header-only library with no dependencies.
There is a RapidXML "GitHub patch" that allows it to also work with namespaces.
I Care About Performance But Not Quite That Much
Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.
You have chosen:
PugiXML
Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.
PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.
But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.
Reading Huge Documents
You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.
You have chosen:
LibXML2
Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.
LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.
Alternatives
Expat
Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.
It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).
LlamaXML
It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.
But it hasn't been updated since 2005. So again, Caveat Emptor.
XPath Support
XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.
There are effectively three choices here:
LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.
Just Get The Job Done
So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.
You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.
You have chosen:
TinyXML
I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.
Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.
There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.
TinyXML uses the zLib license, which is more or less the MIT License with a different name.
There is another approach to handling XML that you may want to consider, called XML
data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.
XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).
See, for example, an open-source XML data binding implementation that I wrote,
CodeSynthesis XSD and, for a
lighter-weight, dependency-free version, CodeSynthesis
XSD/e.
One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.
Ok then. I've created new one, since none of the list wasn't statisfy my needs.
Benefits:
Pull parser Streaming API i.e. parser is like iterator no callback or DOM tree. I.e. reading XML to data structures
Exceptions and RTTI can be off by compiler options, error handling can be done over std::error_code
Limit for memory usage, support for large files (tested with 100 mib XMark file from, speed depends on hardware). There is an example for limited COLLADA format 3D model loading
UNICODE support, and auto-detecting for input source encoding
Project home
Put mine as well.
http://www.codeproject.com/Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML
No XML validation features, but fast.
In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.
Here is an example:
rapidxml::xml_document<char> doc;
doc.parse<0>(xmlData);
rapidxml::xml_node<char>* root = doc.first_node();
rapidxml::xml_node<char>* node_account = 0;
if (GetNodeByElementName(root, "Account", &node_account) == true)
{
rapidxml::xml_node<char>* node_default = 0;
if (GetNodeByElementName(node_account, "default", &node_default) == true)
{
swprintf(result, 100, L"%hs", node_default->value());
free(xmlData);
return true;
}
}
free(xmlData);

What XML parser should I use in C++? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?
Note: This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.
Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:
So the first question is this: What do you need?
I Need Full XML Compliance
OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.
The next question is: Does your API need to conform to DOM or SAX?
I Need Exact DOM and/or SAX Conformance
OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.
You have chosen:
Xerces
That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.
It has no real dependencies. It uses the Apache license.
I Don't Care About DOM and/or SAX Conformance
You have chosen:
LibXML2
LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.
It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.
It uses the MIT license.
I Do Not Need Full XML Compliance
OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.
So what does matter to you? The next question is: What is the most important thing to you in your XML work?
Maximum XML Parsing Performance
Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.
You have chosen:
RapidXML
This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.
Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.
Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).
RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.
One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.
It uses the MIT licence. It is a header-only library with no dependencies.
There is a RapidXML "GitHub patch" that allows it to also work with namespaces.
I Care About Performance But Not Quite That Much
Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.
You have chosen:
PugiXML
Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.
PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.
But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.
Reading Huge Documents
You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.
You have chosen:
LibXML2
Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.
LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.
Alternatives
Expat
Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.
It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).
LlamaXML
It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.
But it hasn't been updated since 2005. So again, Caveat Emptor.
XPath Support
XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.
There are effectively three choices here:
LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.
Just Get The Job Done
So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.
You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.
You have chosen:
TinyXML
I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.
Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.
There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.
TinyXML uses the zLib license, which is more or less the MIT License with a different name.
There is another approach to handling XML that you may want to consider, called XML
data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.
XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).
See, for example, an open-source XML data binding implementation that I wrote,
CodeSynthesis XSD and, for a
lighter-weight, dependency-free version, CodeSynthesis
XSD/e.
One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.
Ok then. I've created new one, since none of the list wasn't statisfy my needs.
Benefits:
Pull parser Streaming API i.e. parser is like iterator no callback or DOM tree. I.e. reading XML to data structures
Exceptions and RTTI can be off by compiler options, error handling can be done over std::error_code
Limit for memory usage, support for large files (tested with 100 mib XMark file from, speed depends on hardware). There is an example for limited COLLADA format 3D model loading
UNICODE support, and auto-detecting for input source encoding
Project home
Put mine as well.
http://www.codeproject.com/Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML
No XML validation features, but fast.
In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.
Here is an example:
rapidxml::xml_document<char> doc;
doc.parse<0>(xmlData);
rapidxml::xml_node<char>* root = doc.first_node();
rapidxml::xml_node<char>* node_account = 0;
if (GetNodeByElementName(root, "Account", &node_account) == true)
{
rapidxml::xml_node<char>* node_default = 0;
if (GetNodeByElementName(node_account, "default", &node_default) == true)
{
swprintf(result, 100, L"%hs", node_default->value());
free(xmlData);
return true;
}
}
free(xmlData);

A lightweight XML parser efficient for large files?

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.
Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint?
The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.
I know about Xerces, but its sheer size of over 50mb gives me shivers.
Thanks!
If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.
The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.
I like ExPat
http://expat.sourceforge.net/
It is C based but there are several C++ wrappers around to help.
RapidXML is quite a fast parser for XML written in C++.
http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)
I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.
I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.
The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.
If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.
See also the SAX2 interface in libxml
firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.
The easiest way to try it out is with a script in the free firstobject XML editor such as this:
ParseHugeXmlFile()
{
CMarkup xml;
xml.Open( "HugeFile.xml", MDF_READFILE );
while ( xml.FindElem("//record") )
{
// process record...
str sRecordId = xml.GetAttrib( "id" );
xml.IntoElem();
xml.FindElem( "description" );
str sDescription = xml.GetData();
}
xml.Close();
}
From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.
you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use
this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks
it parses an input stream and generates events on callbacks compatible to std::function
the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight
I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.
I highly recommend pugixml
pugixml is a light-weight C++ XML processing library.
"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."
I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.
pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.
The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!
DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.
It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.
Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html
A few examples for (x86):
pugixml is more than 38 times faster than TinyXML
4.1 times faster than CMarkup,
2.7 times faster than expat or libxml
For (x64) pugixml is the fastest parser which I know.
Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!