Converting from XML to a C++ Object - c++

I'm working on a C++ project, and wanted to get some inputs from developers with similar experience.
The task is to connect to a web service which gives the results in an XML form. My role in the task is once I receive the XML form, I need to convert the XML into a C++ object and parse the XML data to the C++ object.
Following are my clarifications.
a) One way is to handcraft the whole thing but I need to do this for around hundreds of web services. I am aware there are simpler tools for C# and Java to do the same.
Is there a tool/utility for C++ too?
Any suggestions, would be helpful.

In the past, I've used TinyXML for my XML parsing needs. My parsing code operated under the assumption that all XML input conforms to a particular XSD schema I wrote. It worked fairly well but the ripple effects were annoying - if I wanted to change the XSD, I had to update all my XML test files as well as my parsing code. While it's not so bad in the case of parsing one schema, I'd hate to have to do it for hundreds of them.
I'm not sure what the common solution is, but CodeSynthesis XSD sounds pretty promising. I haven't used it, but it appears that it generates a data layer, a parser and serialisation code for you. Could save you a lot of time.

If you're asking if there's a way to dynamically create an object representation of an XML data stream (such that you can access it like topLevel.subObject.value), it's not possible. C++ is a statically-typed language, which means all objects need to be defined a compile time. The best you could do is something like: xmlData.getSubObject("objectName").getValue().
As for toolsets for parsing into something usable dynamically (as per my later example), there are several. For Windows, for example, you could use the "built-in" MSXML objects. There's nothing in the base C++ libraries to do so, however, as far as I am aware.
Hope that helps.

Related

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"

BizTalk: XSLT versus mapping tool

We're doing a mapping process from an XML file generated by a legacy system to EDI 834/837 files. We have BizTalk 2010 and are using the Microsoft built in EDI schemas.
The EDI files are fairly complex, and the XML file we are getting is also complex, with a lot of pieces bolted on. I started going through the mapping tool, but it seemed like there was a lot of repitition that I could eliminate by running the XML file through an XSLT.
I found the following link, but I'm not happy with just one source. http://blog.eliasen.dk/2009/07/08/CustomXSLTScriptingFunctoidOrBuiltinFunctoidsAQuestionAboutReligion.aspx
So, any other advantages on using the mapping tool over just building a custom XSLT?
My experience with BizTalk maps is that things that are very simple to do with XSLT can be very complex with maps.
For good counter-examples of BizTalk maps, look at the book "Pro Mapping in BizTalk Server 2009". The book has some examples of very complex things you can achieve with BizTalk maps, but the downside to it is that in fact they have hidden all the complexity in scripting functoids. Therefore, the maps are not visual at all anymore (they don't even have links between nodes to provide at least hints to deduce what the map is doing).
XSLT can be more visual than a map, since you can see the resulting XML in the XSLT (keep in mind that "text" does not imply "not visual" - if you are transforming between text formats, then a natural way to visualize the transformation is by looking at text)
BizTalk maps can be used for very simple mappings, where you are essentially copying a set of properties from one structure to another structure with the same properties. However, as soon as you have to map a structure to another different structure, you quickly get something that's hard to write AND hard to read/understand.
Not really, I prefer XSLT too. It's easier to document (using comments in the source) and therefore to maintain. However, keep in mind that in BizTalk 2006 R2 you could not import external XSLTs, which reduces your options for reuse. I have no idea if this has changed in subsequent versions of BizTalk, that's for you to find out and perhaps let us all know...
Not really an answer, more sharing of expierence;
In my team we've had discussion on this issue. The argument for maps was that it is understood by most colleagues (as it is touched by every basic BizTalk training), and XSLT not.
I've personally worked with XSLT for a long time, before i started working with BizTalk, and find the mapper tool very .. unintuitive. Every connection i make raises more questions than it gives me comfort in knowing what the result is. What happens when the source node is nil, not present, or repeating? Whathappens when the target node is defined as minOccurs=2? What does the table mapping functoid do exactly? What does the table value extract functoid do when a value is not found? How do i create a node with an autonumbering sequence, and how do i relate other created nodes that can relate to those nodes by using the generated number?
Working with XSLT gives me the control back, i know exactly what happens.
XSLT maps have the added value of being text-based, wich works well with branching and mering in source control, and allows us to add coments in the sources. Ever tried to merge changes from a map from two diffrent branches?
End result is that we now prefer XSLT for mapping, but not every developer is fluent in XSLT. That requires some training.
One last tip: invest in unit test tooling for your maps. Find an open source toolkit, or write some plumbing to test your maps yourself. Most BizTalk artifacts are perfectly testable, even when it doesn't seem that way, with possible exception for orchestrations (which you should use as a last resort only anyway).
IMO:
Benefits of XSLT
You get better DRY by reusing mapping functionality using XSLT apply + call
templates and custom script functions (e.g. C# script) in the same
map. Unfortunately AFAIK <xsl:include> doesn't work, so you will
need to copy-paste to get reuse across multiple map xslt files.
XSLT native call templates tend to be more performant than C# script (which is how most of the functoids are implemented anyhow)
You can use the XSLT debugger in Visual Studio.
And to emphasize ckarras' point that for complex maps, XSLT is actually easier to understand than a visual spider web.
Benefits of Visual Map
Productivity for trivial maps, e.g. where all elements are exactly the same name and type and can be mapped at the root level, or if you need a dummy map with hard coded output element values.
And I guess the hurdle rate for XSLT may be quite high.
As someone with experience in both BizTalk as well as another GUI-based mapping tool (BridgeGate), I can say that for the non-programmer these applications contain solutions in the form of their mapping interface to solve most problems. When they fall short, they offer a back door to exit to a more code-based solution in the form of a scripting functoid. So while XSLT is certainly an alternative, I find that those who prefer it often are those with more comfort writing code than those who are not.
My experience specifically with 837P and 837I files was with the prior mapping tool (BridgeGate), and it WAS arduous--but that was mainly the fault of the complexity of the file. What I CAN say and what is not being mentioned is that changes later to the process to accommodate client change requests WAS much easier in the GUI-based maps; I can only imagine how it would have been to have to dive into an XSLT big enough to handle 837 transformations and make changes to touch every node involved with a change request. You know how big an 837 is, and how complex the looping can be. Keep that in mind when making your choice.
I don't envy your task, but know the satisfaction when you complete it will make it all worthwhile. Good luck!

XML Representation of C++ Objects

I'm trying to create a message validation program and would like to create easily modifiable rules that apply to certain message types. Due to the risk of the rules changing I've decided to define these validation rules external to the object code.
I've created a basic interface that defines a rule and am wondering what the best way to store this simple data would be. I was leaning towards XML but it seems like it might be too heavy.
Each rule would only need a very small set of data (i.e. type of rule, value, applicable mask, etc).
Does anyone know of a good resource that I could look at that would perform a similar functionality. I'd rather not dig too deep into XML on a problem that seems to barely need a subset of the functionality I see in most of the examples I bump into.
If I can find a concise example to examine I would be able to decide on whether or not to just go with a flat file.
Thanks in advance for your input!
Personally, for small, easily modifiable XML, I find TinyXML to be an excellent library. You can make each class understand it's own format, so your object hierarchy is represented directly in the XML.
However, if you don't think you need XML, you might want to go with a lighter storage like yaml. I find it is much easier to understand the underlying data, modify it and extend functionality.
(Also, boost::serialization has an XML archive, but it isn't what I'd call easily modifiable)
The simplest is to use a flat file designed to be easy to parse using the C++ >> operator. Just simple tokens separated by whitespace.
Well, if you want your rules to be human readable, XML is the way to go, and you can interface it nicely with c++ using xerces. If you want performance and or size, you could save the data as binaries using simple structs.
Another way to implement this would be to define your rules in XML Schema and then have an XML Data Binding tool generate the corresponding C++ object model along with the XML parsing and serialization code. One such tool (that I happen to be working on) is CodeSynthesis XSD:
http://www.codesynthesis.com/products/xsd/
For a 2-minutes overview of the idea, see the "Hello World" example in the C++/Tree mapping documentation.

A lightweight XML parser efficient for large files?

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.
Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint?
The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.
I know about Xerces, but its sheer size of over 50mb gives me shivers.
Thanks!
If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.
The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.
I like ExPat
http://expat.sourceforge.net/
It is C based but there are several C++ wrappers around to help.
RapidXML is quite a fast parser for XML written in C++.
http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)
I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.
I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.
The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.
If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.
See also the SAX2 interface in libxml
firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.
The easiest way to try it out is with a script in the free firstobject XML editor such as this:
ParseHugeXmlFile()
{
CMarkup xml;
xml.Open( "HugeFile.xml", MDF_READFILE );
while ( xml.FindElem("//record") )
{
// process record...
str sRecordId = xml.GetAttrib( "id" );
xml.IntoElem();
xml.FindElem( "description" );
str sDescription = xml.GetData();
}
xml.Close();
}
From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.
you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use
this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks
it parses an input stream and generates events on callbacks compatible to std::function
the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight
I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.
I highly recommend pugixml
pugixml is a light-weight C++ XML processing library.
"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."
I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.
pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.
The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!
DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.
It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.
Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html
A few examples for (x86):
pugixml is more than 38 times faster than TinyXML
4.1 times faster than CMarkup,
2.7 times faster than expat or libxml
For (x64) pugixml is the fastest parser which I know.
Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

Using XSLT to process business rules?

A coworker of mine mentioned that one use of XSLT is processing business rules. He mentioned that there were systems that allowed users to write business rules in some kind of text format, and then the program uses XSLT to process the text and apply the rules at run-time in the application.
Can someone shed some light on this subject for me?
Thanks!
Ouch. I wouldn't recommend that.
As the first responder said, XSL-T is for transforming XML. It's not a rules engine. I think it sounds like a misuse of the technology.
XSL-T transforms are not intuitive to write. If one of your goals for business rules is allowing business folks to update and maintain the rules, I can't imagine a more obtuse and difficult technology for doing so than XSL-T.
I suppose your colleague was refering to BPEL, the Business Process Execution Language. BPEL is an XML-based executable language for describing business processes.
Being an XML format, business rules may be generated or transformed using XSLT. However, I'm not familiar with BPEL so I don't know any system doing something like that.
Yes. The somewhat-like text format is called Excel, and users tend to do all kinds of complex things with it. The programmer then spends an awful lot of time trying to process it with every shiny new technology he can find, including XSLT, and finally decides to hand-code around all the inconsistencies. It is not fully automated, as no sane user trusts the programmer to get it right first time.
XSLT stands for XSL Transform. It is used to change an XML document from one form to another.
As for systems, Microsoft BizTalk uses XSLT in mapping operations that map one XML document into another. Within the XSLT the user can make use of .net code to do more complex processing.
I'm sure someone else will have a much nicer explanation but you can easily find out more by Googling XSLT tutorials. It's a huge topic.
It should be possible: write your rules in XML, the case data should also be in XML, and then a generic XSLT could be written that compares the case data against the rules and executes the relevant rules in the correct sequence.
The business users don't need to know XSLT, they just need to know how to write the rules.