xml parsing with constant memory usage - c++

i am trying to find xml parser with xpath support that uses small amount of memory , or rather constant amount of memory , i am trying to parse large xml files , like almost 1 Giga , i have been reading about xqilla , and it seems that is uses very large amount of memory because it is dom based, correct me if i'm wrong..
anyways , any idea for such xml parser for C++ & linux ?

If you can process the XML in essentially a single pass, a SAX parser would be a good idea. How about Apache Xerces C++?

Saxon-EE supports streaming of large XML documents using XSLT or XQuery (streaming is better supported in XSLT than in XQuery). Details at
Streaming of Large Documents

You might look at
pugixml enables very fast, convenient and memory-efficient XML document processing. However, since pugixml has a DOM parser, it can't process XML documents that do not fit in memory; also the parser is a non-validating one, so if you need DTD/Schema validation, the library is not for you
However, it is explicitely not a streaming parser. I know streaming and xpath do not generally jive well (due to potential random-access requirements). Allthough, in .NET the ever-famous XPathReader seemed to have bridged the gap for a popular subset of XPath :)

Related

eXist DB and Xquery : xincludes or collections (TEI-XML)?

I have a corpus in TEI-XML which uses a 'master' corpus XML document that then contains, via xi:include, thousands of other documents. Each of these documents themselves contain xi:includes to master lists of named entities (people, places, etc linked by xml:ids) . All of this works very well in XSLT (and in my IDE Oxygen for fast encoding).
I am now embarking on building a website using eXist-DB applications. I am rewriting everything directly in Xquery (to replace XSLT), and I have hit upon an unexpected decision. I am used to using xi:includes to traverse the corpus and the various XMLs files. But reading the documentation of eXist DB, it seems that the encouraged practice is to use collections and query them directly, instead of navigating via xi:includes. It also seems that eXist-DB does not support the full implementation of xi:includes anyway and requires some work arounds?
I am looking for guidance as to best practices of eXist-DB/Xquery in this context.
Many thanks in advance.
Correct, eXist's XInclude implementation is focused on output (i.e., serialization) rather than on querying or indexing. As eXist's documentation page on XInclude states:
The XInclude processor is implemented as a filter in between the serializer's output event stream and the receiver... XInclude processing is therefore applied whenever eXist-db serializes an XML fragment, whether it's a document, the result of an XQuery or an XSLT stylesheet.
Thus, if you use XInclude to assemble your corpus and you want to query/traverse this corpus, you could do so by (1) writing a query to read your XInclude and following it like a map to find the component documents, (2) pre-serializing your data into a new document and then querying the resulting document directly, or (3) placing the documents into collections that facilitate the kinds of queries you want to do.
Depending on the size of those thousands of documents, traversing the xinclude when running xqueries tends to be slow and quite memory intensive. In my experience Joe's option 3 is usually the way to go.
Unlike with straight-up xslt, in exist-db you can define indexes. E.g. you have a <listPerson> element as a wrapper for 1000s xincludes going to <person> elements as root of their own document.
If you have defined and index for <person> you can use e.g. ft:query() to query the index directly, irrespective of where in the tree of sub-collections and documents the element is located. This tends to be orders of magnitude faster, compared to traversing the whole document starting at master, and resolving xincludes.
As for validation, you will need to decide if a full validation run of the whole expanded document is really always necessary. This requires some fiddling, but there isn't much general advice I can offer, without seeing the actual files and code.
You can find more information about indexing in exist in the documentation

Which libxml2 API should I use for large files?

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.
It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.
I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?
Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.
I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...
If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

How to get correct data in a large xml file?

I have a large xml file (contains about few million records) and need to get about 100 records (based on id or something like that)
I tried TinyXml and Xalan-C but both of them using DOM, therefore it cause a out of memory issue.
Is there a C/C++ library that can do that without loading all data to memory as DOM?
How about Apache Xerces?
It's pretty damn mature and is optimized for performance (i.e. it won't read your complete files into memory!).
You need a SAX parser like Xerces
The Saxon-EE XSLT processor can handle a subset of XSLT in streaming mode (that is, without building a tree in memory). For details see
http://www.saxonica.com/documentation/sourcedocs/streaming.xml
It's not C/C++, but you don't say whether that's a hard constraint.

Is there a lightweight approach in producing XML with Xerces-C++?

This application runs on an embedded platform with low processing power and memory. I want to produce huge XML from the application. Currently I am constructing DOM and serializing into XML using Xerces-C++ 3.1.1. But the DOM construction takes long time and consumes lot of memory.
I know SAX is lightweight approach of parsing XML compared to DOM. Like that is there a lightweight approach for producing XML? Ofcourse I can produce the XML by concatenating strings but I didn't choose that approach because I want to make sure I produce a well-formed XML and sanitize the texts I include in it.
What you are looking for is normally called streaming serialization where parts of the document are written out as they become available instead of accumulation them all and writing them out at the end (which is what the DOM approach entails).
Xerces-C++ does not currently have streaming serialization support. But it is not very difficult to emulate it using DOM. The idea is to construct a DOM document fragment when a chunk of your data is ready to be serialized, write it out using the DOMWriter API, and free it once done. When you have another chunk ready, repeat the above steps. The result is an application that uses only a fraction of the memory that would be required to create the complete document.
We use this approach in CodeSynthesis XSD, an XML data binding toolkit for C++, to be able to handle XML documents that are too big to fit into memory. In fact, we have written some helper classes that simplify all this and wich you can find as part of the 'streaming' example in the examples/cxx/tree/ directory (the example code is public domain so feel free to borrow it ;-)).

which is the most efficient XML Parser for C++?

I need to write an application that fetches element name value (time-series data) pair from any xml source, be it file, web server, any other server. the application would consume the XML and take out values of interest, it has to be very very fast (lets say 50000 events/seconds or more) also the XML document size would be huge and frequency of these document could be high as well (for ex. 2500 files/min - more than 500MB of XML data/file).
I just want to see how you experienced people think I should approach this. I am a novice who just got started although I can do any solution you suggest me, no matter how tough/easy.
Thank you very much.
If you use SAX parsing, your bottleneck is the I/O involved, not the XML string processing. And given your 500 MB number, I'd say you'd have to do SAX parsing instead of DOM parsing. So, anything with a SAX type interface should be just fine.
I'm a fan of Xerces, I think you are going to have to try them out to see what has the best performance for your application. Like Warren said you will want to use SAX processing. Realistically if you truly need the performance you should use a specialized XML appliance to do the processing.
I use libxml2 in our projects. It supports both SAX and DOM.
As Warren Young said, you should use SAX. You could give Expat a try.