Streaming xml modification to zip - c++

I am working with potentially "large" xml files where my application only cares about a very small subset of the data contained in the file. So I was hoping to avoid loading the entire xml document into DOM.
I have been successfully using Apache Xerces C++ with the Sax2 api to extract data directly from an xml file contained in a zip archive, using custom implementations of xercesc::BinInputStream and xercesc::InputSource
However, now we want to apply modifications to a small subset of the nodes in the xml document (reading the original, and applying changes into a new xml file in a new zip archive). I was hoping to avoid loading the entire document into DOM just to modify a few nodes.
It would be nice to leverage the work I've already done with SAX2, but it appears that the SAX2 api is primarily oriented around reading documents. I could handle all SAX2 events, and write the information out to the new file as they occur, but I'm having difficulty locating xerces api functionality that would, for example, aid with handling xml entities (I really don't want to rewrite e.g. xml entity handling myself!) and other encoding issues.
I also noticed that xerces provides a xercesc::BinOuputStream (which would appear to be what I would want to derive from in order to directly serialize to a zip archive), but I haven't found a place where I could plug such a custom output stream into the xerces api. I also haven't been able to locate a corresponding output analogue for xercesc::InputSource.
Does xerces c++ provide any native functionality for writing xml documents in a streaming fashion?

Related

XML bindings for Microsoft XMLLite

I have a C++ project in which I am using Microsoft XmlLite for parsing several XML files. Now I have a new file that I need to parse and I have an XSD schema for it. I know there are many C++ XML binding tools out there, but all I have found so far require me to include yet another XML parsing library, which I would like to avoid. Hence my question: is there any open source or commercial tool that generates C++ XML bindings based on Microsoft XmlLite?
CodeSynthesis seems to be the closest tool which will provide in-memory XML data binding to integrate with XMLLite.
The C++/Tree mapping generates C++ classes that represent data types defined in XML Schema, a set of parsing functions that convert XML documents to a tree-like in-memory object model, and a set of serialization functions that convert the object model back to XML.

Is there a lightweight approach in producing XML with Xerces-C++?

This application runs on an embedded platform with low processing power and memory. I want to produce huge XML from the application. Currently I am constructing DOM and serializing into XML using Xerces-C++ 3.1.1. But the DOM construction takes long time and consumes lot of memory.
I know SAX is lightweight approach of parsing XML compared to DOM. Like that is there a lightweight approach for producing XML? Ofcourse I can produce the XML by concatenating strings but I didn't choose that approach because I want to make sure I produce a well-formed XML and sanitize the texts I include in it.
What you are looking for is normally called streaming serialization where parts of the document are written out as they become available instead of accumulation them all and writing them out at the end (which is what the DOM approach entails).
Xerces-C++ does not currently have streaming serialization support. But it is not very difficult to emulate it using DOM. The idea is to construct a DOM document fragment when a chunk of your data is ready to be serialized, write it out using the DOMWriter API, and free it once done. When you have another chunk ready, repeat the above steps. The result is an application that uses only a fraction of the memory that would be required to create the complete document.
We use this approach in CodeSynthesis XSD, an XML data binding toolkit for C++, to be able to handle XML documents that are too big to fit into memory. In fact, we have written some helper classes that simplify all this and wich you can find as part of the 'streaming' example in the examples/cxx/tree/ directory (the example code is public domain so feel free to borrow it ;-)).

High performance XML parsing in C++

Well a lot of questions have been made about parsing XML in C++ and so on...
But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse.
My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?
You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.
Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.
SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.
Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.
If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.
I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.
Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.
someone say QtXML module is high performance for huge XML files.

Storing UTF-8 XML using Word's CustomXMLPart or any other supported way

I am writing a Word add-in which is supposed to store some own XML data per document using Word object model and its CustomXMLPart. The problem I am now facing is the lack of IStream-like functionality for reading/writing XML to/from a CustomXMLPart. It only provides BSTR interface and I am puzzled how to handle UTF-8 XMLs with BSTRs. To my understanding an UTF-8 XML file should really never have to undergo this sort of Unicode conversion. I am not sure what to expect as a result here.
Is there another way of using Word automation interfaces to store arbitrary custom information inside a DOCX file?
The "package" is an OPC document (Open Packaging Convention), which is basically a structured zip folder with a different extension (e.g. .pptx, .docx, .xps, etc.). You can get that file in stream and manipulate it any which way you like - but not artibitrarily. It will not be recognized as valid docx if you put things in the wrong places (not just xml elements, but also files in the folders inside the zip file). But if you're just talking "artibitrary" meaning CustomXMLPart, then that's okay.
This is a good kicker page to learn more about the Open XML SDK and if you're up to it, which allows for somewhat easier access to the file formats than using (.NET) System.IO.Packaging or a third-party zip library. To go deeper, grab the eBook (free) Open XML Explained.
With the Open XML SDK (again, this can all be done without the SDK) in .NET, this is what you'll want to do: How to: Insert Custom XML to an Office Open XML Package by Using the Open XML API.

XML usage for c++ application

I have a couple of questions about XML.
Can XML be used for normal c++ application instead of using a text file ?
If so, does this method have advantages?
and finally, how can I use XML to store data? what tools are needed?
Regards.
You can use XML for storing information - it's less Human readable than a text file, but can be more easily communicated with other systems and coding languages.
If all you need is a few text/numeric properties, stick to a property file.
If you need a mix of configuration options, and you want to use validation (can be accomplished using XML schema), automatic modification (e.g. XSL transformations) or communicate it easily with Web Services, than XML is useful.
If you want to store binary data, XML is probably not that answer. Though you can store it in a filesystem and use the XML for the metadata (i.e. where each file is located).
Take a look at Apache Xerces-C for C++ XML code - http://xerces.apache.org/xerces-c/
XML can be parsed as a text file by your application. There are libraries available.
Advantage: the files can be exchanged with other applications more easily, especially if you provide an XML-schema file.
Storing data in XML can be done with boost.serialization
It depends of the kind of data you want to read/write, but XML is generally a good way to go for storing structured and hierarchical datas.
You can use librairies such as TinyXML to easily parse and write XML files in C++.
The main drawback is that XML is verbose ; that's why you can also use an alternative such as JSON to store your datas.