Which libxml2 API should I use for large files?

Which libxml2 API should I use for large files? - c++

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.
It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.
I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?
Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.

I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...

If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

Related

How to get correct data in a large xml file?

I have a large xml file (contains about few million records) and need to get about 100 records (based on id or something like that)
I tried TinyXml and Xalan-C but both of them using DOM, therefore it cause a out of memory issue.
Is there a C/C++ library that can do that without loading all data to memory as DOM?

How about Apache Xerces?
It's pretty damn mature and is optimized for performance (i.e. it won't read your complete files into memory!).

You need a SAX parser like Xerces

The Saxon-EE XSLT processor can handle a subset of XSLT in streaming mode (that is, without building a tree in memory). For details see
http://www.saxonica.com/documentation/sourcedocs/streaming.xml
It's not C/C++, but you don't say whether that's a hard constraint.

Is there a lightweight approach in producing XML with Xerces-C++?

This application runs on an embedded platform with low processing power and memory. I want to produce huge XML from the application. Currently I am constructing DOM and serializing into XML using Xerces-C++ 3.1.1. But the DOM construction takes long time and consumes lot of memory.
I know SAX is lightweight approach of parsing XML compared to DOM. Like that is there a lightweight approach for producing XML? Ofcourse I can produce the XML by concatenating strings but I didn't choose that approach because I want to make sure I produce a well-formed XML and sanitize the texts I include in it.

What you are looking for is normally called streaming serialization where parts of the document are written out as they become available instead of accumulation them all and writing them out at the end (which is what the DOM approach entails).
Xerces-C++ does not currently have streaming serialization support. But it is not very difficult to emulate it using DOM. The idea is to construct a DOM document fragment when a chunk of your data is ready to be serialized, write it out using the DOMWriter API, and free it once done. When you have another chunk ready, repeat the above steps. The result is an application that uses only a fraction of the memory that would be required to create the complete document.
We use this approach in CodeSynthesis XSD, an XML data binding toolkit for C++, to be able to handle XML documents that are too big to fit into memory. In fact, we have written some helper classes that simplify all this and wich you can find as part of the 'streaming' example in the examples/cxx/tree/ directory (the example code is public domain so feel free to borrow it ;-)).

xml parsing with constant memory usage

i am trying to find xml parser with xpath support that uses small amount of memory , or rather constant amount of memory , i am trying to parse large xml files , like almost 1 Giga , i have been reading about xqilla , and it seems that is uses very large amount of memory because it is dom based, correct me if i'm wrong..
anyways , any idea for such xml parser for C++ & linux ?

If you can process the XML in essentially a single pass, a SAX parser would be a good idea. How about Apache Xerces C++?

Saxon-EE supports streaming of large XML documents using XSLT or XQuery (streaming is better supported in XSLT than in XQuery). Details at
Streaming of Large Documents

You might look at
pugixml enables very fast, convenient and memory-efficient XML document processing. However, since pugixml has a DOM parser, it can't process XML documents that do not fit in memory; also the parser is a non-validating one, so if you need DTD/Schema validation, the library is not for you
However, it is explicitely not a streaming parser. I know streaming and xpath do not generally jive well (due to potential random-access requirements). Allthough, in .NET the ever-famous XPathReader seemed to have bridged the gap for a popular subset of XPath :)

High performance XML parsing in C++

Well a lot of questions have been made about parsing XML in C++ and so on...
But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse.
My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?

You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.

Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.

SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.

Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.

If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.

I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.

Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.

someone say QtXML module is high performance for huge XML files.

which is the most efficient XML Parser for C++?

I need to write an application that fetches element name value (time-series data) pair from any xml source, be it file, web server, any other server. the application would consume the XML and take out values of interest, it has to be very very fast (lets say 50000 events/seconds or more) also the XML document size would be huge and frequency of these document could be high as well (for ex. 2500 files/min - more than 500MB of XML data/file).
I just want to see how you experienced people think I should approach this. I am a novice who just got started although I can do any solution you suggest me, no matter how tough/easy.
Thank you very much.

If you use SAX parsing, your bottleneck is the I/O involved, not the XML string processing. And given your 500 MB number, I'd say you'd have to do SAX parsing instead of DOM parsing. So, anything with a SAX type interface should be just fine.

I'm a fan of Xerces, I think you are going to have to try them out to see what has the best performance for your application. Like Warren said you will want to use SAX processing. Realistically if you truly need the performance you should use a specialized XML appliance to do the processing.

I use libxml2 in our projects. It supports both SAX and DOM.
As Warren Young said, you should use SAX. You could give Expat a try.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Which libxml2 API should I use for large files? - c++

I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...

Related

How to get correct data in a large xml file?

Is there a lightweight approach in producing XML with Xerces-C++?

xml parsing with constant memory usage

High performance XML parsing in C++

which is the most efficient XML Parser for C++?

Categories

Resources