Good Option for XML Edit/Replace - c++

I have a huge (100k+ lines, 5MB+) XML which acts as a database for my C++ Application. The structure of the XML is quite straight forward, for example, it has chunks of:
<foo>
<bar prop="true"/>
<baz>blah</baz>
</foo>
The nesting of tags is several levels deep and there are many items with multiple properties. What is a good way to find and replace chunks of this kind of a file? For example, assume that the above section is repeated a few dozen times and in each chunk the value of the tag <baz> is different. I'd like to make edits such as:
Setting all the values contained in tag <baz> to a given value.
Remove chunks containing certain values
Etc.
So far, I've learnt of the following methods for accomplishing this:
Find/Replace: A no-brainer, trivial solution and also my last fall-back. This approach, IMHO is the most time consuming, error prone and painful method. The absolute last resort.
RegExes: Use regular expressions to match blocks of interest and edit them using replacement expressions. Kinda like this blog entry: http://blogs.msdn.com/b/vseditor/archive/2004/08/12/213770.aspx. But I feel this would be error prone and there could be a bunch of missed items if the regex is not exactly right the first time around.
Parser & Save: Whip up a quick program to parse the XML using Xerces or XML DOM Interfaces (or some other XML library), read the XML in, manipulate it as desired and save back to disk. Again, this approach is a slow process, but once its up and running, easy to make modifications and more flexible then RegExes.
Are there any better ways to deal with this?
(EDIT: Thanks for all the redo it to use a DB suggestions, I know its a huge mess but by "better ways to deal with this" I meant the "find/replace" part. )

If you don't want to put the entire document in memory, I would read it using a SAX parser. As you read it, you append the transformed document to a second (or a temp) file. I think it could be pretty fast, and use only a little memory footprint.

Are there any better ways to deal with this?
If you must use XML, you could use an XML database such as BDB XML (which has C++ APIs). It supports XQuery, transactions, etc.
Other options include TinyXML which I've used with success in the past. Quick and easy to use, not necessarily the fastest on a file of that size, but it will get the job done.

What are your actual memory constraints? 5MB is large but not enormous by current RAM standards.
I would use DOM with XPath if you can, it will be a lot less development work than SAX or other stream-based parsing. My problem with SAX is that if you are really using this as a in-memory DB, that implies random access on-demand and SAX is not well-suited for that - you will have to parse and reserialize over and over, whereas once you have the DOM at least you can play with it as you like.
Echo comments about to store in-RAM database info too. Plenty of alternatives that are better suited to this than XML. Maybe you could implement a tactical solution using DOM/XPath and investigate rip-and-replace as a longer-term project.

Related

Sharing data structures between perl and cpp

I have a perl script which generates a very large data structure (which starts life as an array of array references). This is then written to a text file using some weird home-brew serialisation scheme.
The data from the text file is stored as the value in a key-value store db.
A c++ file then retrieves the data, and deserializes it (into a hashmap, although can potentially be flexible on how this data is structured).
What I'm interested in is finding if there are any good ways of sharing a data structure between perl and c++ (something like Storable, but that is meant for perl->perl not perl->c++). The current method is a headache to maintain, and may not have the best performance.
The most important factors are speed of deserialisation, and the size of the serialized structure in that order. Anyone know of something that might do the trick?
Storable is one way to dump and load perl data structures. I wouldn't actually recommend it for general usage though - it's handy in that it's part of core and easy to use.
But for multi-platform (and language) portability, it's far better to use a standard data representation. Which you choose is probably a matter of what sort of data you're holding in your structure, but core contenders are:
JSON - good for arrays and hashes (key-value).
YAML - Excellent for 'config file' style data (but extends in ways similar to JSON)
And if you must, XML - but bear in mind that XML is designed for documents-with-metadata, and so IMO isn't suitable for most of the applications it's used for.
As standards, they've got documented formatting and parsers are widely available. And implementing your own isn't too hard, if that's the route you want to go. Just make sure you follow the spec and you're good.
Note - that because XML and JSON (and I think YAML?) are recursive, you can parse as a stream, rather than a standalone object. (Trap, process and discard as you hit 'close brackets' in JSON, or 'close tags' in XML).
easy job.
I like perl , and I also like C/C++. To make the best of both,
I wrote a github project to solve this issue.
please see:
https://github.com/tlqtangok/perlcpp
a short example is here :
P_eval("$a=2;$a=$a**10;");
Int("a") ; // a= 1024

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"

BizTalk: XSLT versus mapping tool

We're doing a mapping process from an XML file generated by a legacy system to EDI 834/837 files. We have BizTalk 2010 and are using the Microsoft built in EDI schemas.
The EDI files are fairly complex, and the XML file we are getting is also complex, with a lot of pieces bolted on. I started going through the mapping tool, but it seemed like there was a lot of repitition that I could eliminate by running the XML file through an XSLT.
I found the following link, but I'm not happy with just one source. http://blog.eliasen.dk/2009/07/08/CustomXSLTScriptingFunctoidOrBuiltinFunctoidsAQuestionAboutReligion.aspx
So, any other advantages on using the mapping tool over just building a custom XSLT?
My experience with BizTalk maps is that things that are very simple to do with XSLT can be very complex with maps.
For good counter-examples of BizTalk maps, look at the book "Pro Mapping in BizTalk Server 2009". The book has some examples of very complex things you can achieve with BizTalk maps, but the downside to it is that in fact they have hidden all the complexity in scripting functoids. Therefore, the maps are not visual at all anymore (they don't even have links between nodes to provide at least hints to deduce what the map is doing).
XSLT can be more visual than a map, since you can see the resulting XML in the XSLT (keep in mind that "text" does not imply "not visual" - if you are transforming between text formats, then a natural way to visualize the transformation is by looking at text)
BizTalk maps can be used for very simple mappings, where you are essentially copying a set of properties from one structure to another structure with the same properties. However, as soon as you have to map a structure to another different structure, you quickly get something that's hard to write AND hard to read/understand.
Not really, I prefer XSLT too. It's easier to document (using comments in the source) and therefore to maintain. However, keep in mind that in BizTalk 2006 R2 you could not import external XSLTs, which reduces your options for reuse. I have no idea if this has changed in subsequent versions of BizTalk, that's for you to find out and perhaps let us all know...
Not really an answer, more sharing of expierence;
In my team we've had discussion on this issue. The argument for maps was that it is understood by most colleagues (as it is touched by every basic BizTalk training), and XSLT not.
I've personally worked with XSLT for a long time, before i started working with BizTalk, and find the mapper tool very .. unintuitive. Every connection i make raises more questions than it gives me comfort in knowing what the result is. What happens when the source node is nil, not present, or repeating? Whathappens when the target node is defined as minOccurs=2? What does the table mapping functoid do exactly? What does the table value extract functoid do when a value is not found? How do i create a node with an autonumbering sequence, and how do i relate other created nodes that can relate to those nodes by using the generated number?
Working with XSLT gives me the control back, i know exactly what happens.
XSLT maps have the added value of being text-based, wich works well with branching and mering in source control, and allows us to add coments in the sources. Ever tried to merge changes from a map from two diffrent branches?
End result is that we now prefer XSLT for mapping, but not every developer is fluent in XSLT. That requires some training.
One last tip: invest in unit test tooling for your maps. Find an open source toolkit, or write some plumbing to test your maps yourself. Most BizTalk artifacts are perfectly testable, even when it doesn't seem that way, with possible exception for orchestrations (which you should use as a last resort only anyway).
IMO:
Benefits of XSLT
You get better DRY by reusing mapping functionality using XSLT apply + call
templates and custom script functions (e.g. C# script) in the same
map. Unfortunately AFAIK <xsl:include> doesn't work, so you will
need to copy-paste to get reuse across multiple map xslt files.
XSLT native call templates tend to be more performant than C# script (which is how most of the functoids are implemented anyhow)
You can use the XSLT debugger in Visual Studio.
And to emphasize ckarras' point that for complex maps, XSLT is actually easier to understand than a visual spider web.
Benefits of Visual Map
Productivity for trivial maps, e.g. where all elements are exactly the same name and type and can be mapped at the root level, or if you need a dummy map with hard coded output element values.
And I guess the hurdle rate for XSLT may be quite high.
As someone with experience in both BizTalk as well as another GUI-based mapping tool (BridgeGate), I can say that for the non-programmer these applications contain solutions in the form of their mapping interface to solve most problems. When they fall short, they offer a back door to exit to a more code-based solution in the form of a scripting functoid. So while XSLT is certainly an alternative, I find that those who prefer it often are those with more comfort writing code than those who are not.
My experience specifically with 837P and 837I files was with the prior mapping tool (BridgeGate), and it WAS arduous--but that was mainly the fault of the complexity of the file. What I CAN say and what is not being mentioned is that changes later to the process to accommodate client change requests WAS much easier in the GUI-based maps; I can only imagine how it would have been to have to dive into an XSLT big enough to handle 837 transformations and make changes to touch every node involved with a change request. You know how big an 837 is, and how complex the looping can be. Keep that in mind when making your choice.
I don't envy your task, but know the satisfaction when you complete it will make it all worthwhile. Good luck!

A lightweight XML parser efficient for large files?

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.
Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint?
The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.
I know about Xerces, but its sheer size of over 50mb gives me shivers.
Thanks!
If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.
The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.
I like ExPat
http://expat.sourceforge.net/
It is C based but there are several C++ wrappers around to help.
RapidXML is quite a fast parser for XML written in C++.
http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)
I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.
I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.
The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.
If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.
See also the SAX2 interface in libxml
firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.
The easiest way to try it out is with a script in the free firstobject XML editor such as this:
ParseHugeXmlFile()
{
CMarkup xml;
xml.Open( "HugeFile.xml", MDF_READFILE );
while ( xml.FindElem("//record") )
{
// process record...
str sRecordId = xml.GetAttrib( "id" );
xml.IntoElem();
xml.FindElem( "description" );
str sDescription = xml.GetData();
}
xml.Close();
}
From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.
you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use
this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks
it parses an input stream and generates events on callbacks compatible to std::function
the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight
I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.
I highly recommend pugixml
pugixml is a light-weight C++ XML processing library.
"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."
I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.
pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.
The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!
DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.
It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.
Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html
A few examples for (x86):
pugixml is more than 38 times faster than TinyXML
4.1 times faster than CMarkup,
2.7 times faster than expat or libxml
For (x64) pugixml is the fastest parser which I know.
Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

Memory-efficient XSLT Processor

I need a tool to execute XSLTs against very large XML files. To be clear, I don't need anything to design, edit, or debug the XSLTs, just execute them. The transforms that I am using are already well optimized, but the large files are causing the tool I have tried (Saxon v9.1) to run out of memory.
I found a good solution: Apache's Xalan C++. It provides a pluggable memory manager, allowing me to tune allocation based on the input and transform.
In multiple cases it is consuming ~60% less memory (I'm looking at private bytes) than the others I have tried.
You may want to look into STX for streaming-based XSLT-like transformations. Alternatively, I believe StAX can integrate with XSLT nicely through the Transformer interface.
It sounds like you're sorted - but often, another potential approach is to split the data first. Obviously this only works with some transformations (i.e. where different chunks of data can be treated in isolation from the whole) - but then you can use a simple parser (rather than a DOM) to do the splitting into manageable pieces, then process each chunk separately and reassemble.
Since I'm a .NET bod, things like XmlReader can do the chunking without a DOM; I'm sure there are equivalents for every language.
Again - just for completeness.
[edit re question]
I'm not aware of any specific name; maybe Divide and Conquer.
For an example; if your data is actually a flat list of like objects, then you could simply split the first-level children - i.e. rather than having 2M rows, you split it into 10 lots of 200K rows, or 100 lots of 20K rows. I've done this before lots of times for working with bulk data (for example, uploading in chunks of data [all valid] and re-assembling at the server so that each individual upload is small enough to be robust).
For what it's worth, I suspect that for Java, Saxon is as good as it gets, if you need to use XSLT. It is quite efficient (both cpu and memory) for larger documents, but XSLT itself essentially forces full in-memory tree of contents to be created and retained, except for limited cases. Saxon-SA (for-fee version) supposedly has extensions to allow taking advantage of such "streaming" cases, so that might be worth checking out.
But the advice to split up the contents is the best one: if you are dealing with independent records, just split the input using other techniques (like, use Stax! :-) )
I have found that a custom tool built to run the XSLT using earlier versions of MSXML makes it very fast, but also consumes incredible amounts of memory, and will not actually complete if it is too large. You also lose out on some advanced XSLT functionality as the earlier versions of MSXML don't support the full xpath stuff.
It is worth a try if your other options take too long.
That's an interesting question. XSLT could potentially be optimized for space, but I expect all but the most obscure implementations around start by parsing the source document into DOM, which is bound to use a low multiple of the document size in memory.
Unless the stylesheet is specially designed to support a single-pass transformation, reasonable time performance would probably require parsing the source document into a disk-based hierarchical database.
I do not have an answer, though.
It appears that Saxon 9.2 may provide an answer to your problem. If your document can be transformed without using predicates (does not reference any siblings of the current node) you may be able to use Streaming XSLT.
See this link
I have not tried this myself, I am just reading about it. But I hope it works.
Are you using the Java version of Saxon, or the .Net port? You can assign more memory to the Java VM running Saxon, if you are running out of memory (using the -Xms command line parameter).
I've also found that the .Net version of Saxon runs out of memory less easily than the Java version.
For .NET you can use solution suggestion on Microsoft Knowledge Base:
http://support.microsoft.com/kb/307494
XPathDocument srcDoc = new XPathDocument(srcFile);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
myXslTransform.Load(xslFile);
using (XmlWriter destDoc = XmlWriter.Create(destFile))
{
myXslTransform.Transform(srcDoc, destDoc);
}
Take a look at Xselerator