Good approaches for processing xml in C++ - c++

I work on a multithreaded message processing application written in C++. The application receives xml messages, performs some action, and may publish out an xml message to another service if required.
Currently, the app works by extracting data while parsing the message and performing some action on that message in the middle of parsing. This seems like poor practice to me. I have the opportunity to create an alternative, and I'm considering approaches I can use.
One method I've thought of is to serialize the xml data into a data object, and once that is finished, extract and process data as needed. The disadvantage would be that I have to build a new class for each different xml message I process (probably around 30), but that approach seems cleaner than what I have now.
Is there a better way than this? Should also mention the caveat that any code libraries developed outside the U.S. are unlikely to be approved.

Currently, the app works
Then what exactly are you fixing?
Don't fix what isn't broken.

There are typically two approaches to XML parsing: DOM and SAX. DOM builds up a document object model (like what you are proposing), whereas SAX invokes callbacks as parts of the document are visited during parsing. The free, well-known libxml2 library supports both parsing methods.
Typically, the SAX approach (i.e., using callbacks that get executed as the document is visited), uses less memory and can result in lower end-user latency, because you can start processing immediately, instead of having to wait for the entire document to have been parsed and built up.
The fact that your program is multithreaded is a red-herring. As long as you always pass an object to each of your callbacks, and that object is not shared between threads, you can safely do this with multiple different such objects in multiple different threads. Using a standard library such as libxml2 to do your parsing is also sensible from a reuse perspective.

There were probably some design decisions that were made which led to this approach (say for example, it's faster to process using a SAX like model than a DOM like model), with the latter you need to parse the entire message, with the former you can make decisions as you are called back with data.
I'd try to understand these first before making any changes, secondly aside from keeping you busy, is there a real business need for it? If not, move on and do something else...

Related

Why perform transformations in middleware?

A remote system sends a message via middleware (MQ) to my application.
In middleware a transformation (using xslt) is applied to this message. It is just reformatted and there is no enrichment nor validation. My system is the only consumer of this transformed message and the xslt is maintained by my team.
The original author of all of this has long gone and I am wondering why he thought it was a good idea to do the transformation in middleware rather than in my app. I can't see the value in moving this to middleware, it makes it less visible and less simple to maintain.
Also I would have thought that the xslt would be maintained by the message producer not the consumer.
Are there any guidelines for this sort of architecture? Has he done the right thing here?
It is a bad idea to modify a message body in the middleware. This negatively affects the maintainability and performance.
The only reason of doing this is trying to connect two incompatible endpoints without modifying them. This would require the transformation of the source content to be understood by the destination endpoint.
The motivation to delegate middleware to perform transformation could be a political one (endpoints are maintained by different teams, management is reluctant to touch the endpoint code, etc.).
If you are trying to create an application architecture where there is a need to serve data to different users in different formats, and perhaps receive data in different formats (think weather reports, or sports news), then creating a hub capable of doing the transformations between many different formats makes excellent sense. (Whether you call that "middleware" is up to you.) Perhaps your predecessor had this kind of architecture in mind, but it never grew big or complex enough to justify the design.
From a architectural point of view, it's a good idea to provide consumers of messages or content that is in a humanly readable format, e.g. xslt, unless there is a significant performance gain in using a binary format.
In the humanly readable format case, one simply has to look at the message to verify that it is correct. In the binary case, one would have to develop a utility to tranform binary message into a humanly readable form. Different implementers of such a utility may not always interpret the binary form as intended and it may turn into a finger pointing exercise as to who or what is correct.
Also, if one is looking at what's in the queue, it is easier to make sense of it if the messages are in a humanly readable format.
It doesn't hurt to start with humanly readable format and get the app working first. Then profile the app and see if in the big picture the transformation routines are significant sources of delay. If yes, then go to a binary format.
It would have been preferable to have the original message producer provide messages in xslt format, but they must have had good reasons for doing what they did when they did it. E.g potentially other consumers, xslt didn't exist then, resource constraints, etc.
Read about the adaptor design pattern and you will understand the intent of the current system architecture.

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"

How to design a C++ API

I'm fairly new to advanced C++ program techniques such as templates,
but I am developing a simple API for a project I'm working on.
The function or method that you call can take a long time to complete.
Essentially it's transferring a file over the network.
It looks a bit like this.
Client
{
int WriteFile();
int ReadFile();
}
But I want to have a couple of options here.
call WriteFile and have it block.
Call WriteFileAsync and not have it block.
In the async version be flexible about how I know the task is done.
Be able to poll the client to find out where it's up to with my current Read or Write
operation.
I'm at a bit of a loss as to how to design this nicely the C++ way.
It's a requirement to avoid using boost, but I could use a boost-like approach.
Although, I looked through some of the headers and got very much confused. Anything beyond
basic template programming for me I find confusing.
What I'm after is a nice way of being notified of event completion and be able to wait for
an event to complete.
My advice would be looking at the docs and tutorial for boost::asio (which you can use as part of boost or as part of the independent asio project, but I guess that the requirement is no external libs, not just no boost).
Usually blocking calls are simple to define, while non-blocking operations require some callback mechanism as to notify the user of the result of the operation whenever that completes. Again, take a look at the tutorials and docs to get an idea of a clean interface, that will be much easier to browse over than the headers.
EDIT: ASIO has support for different protocols, so it might be more complex than what you need, read one of the examples and get the ideas of how to use callback mechanisms.
Regarding the use of asynchronous calls, I would suggest reading about the design of the future for C++0x.
Basically, the idea is to hand a proxy to the user, instead of a plain type. This proxy is aware of the threading and can be used to:
poll about the completion
get the result
You can also add clever mechanisms like trying to get the result for a fixed duration or up to a fixed point in time and abandon (for the moment) if the task hasn't completed in time (for example to do something else and try again later, or to simple go forward and forget about this).
The new threading API of C++0x has been very cleverly designed (taking mainly after Boost.Threads) and would give you much insight as to how to design for multi-threading.

How can I decrease complexity in library without increasing complexity elsewhere?

I am tasked to maintain and update a library which allows a computer to send commands at a hardware device and then receive its response. Currently the code is setup in such a way that every single possible command the device can receive is sent via its own function. Code repetition is everywhere; a DRY advocate's worst nightmare.
Obviously there is much opportunity for improvement. The problem is each command has a different payload. Currently the data that is to be the payload is passed to each command function in the form of arguments. It's difficult to consolidate functionality without pushing the complexity to a level that calls the library.
When a response is received from the device its data is put into an object of a class solely responsible for holding this data, they do nothing else. There are hundreds of classes which do this. These objects are then used to access the returned data by the app layer.
My objectives:
Throughly reduce code repetition
Maintain similiar level of complexity at application layer
Make it easier to add new commands
My idea:
Have one function to send a command and one to receive (the receiving function is automatically called when a response from the device is detected). Have a struct holding all command/response data which will be passed to sending function and returned by receiving function. Since each command has a corresponding enum value, have a switch statement which sets up any command specific data for sending.
Is my idea the best way to do it? Is there a design pattern I could use here? I've looked and looked but nothing seems to fit my needs.
Thanks in advance! (Please let me know if clarification is necessary)
This reminds me of the REST vs. SOA debate, albeit on a smaller physical scale.
If I understand you correctly, right now you have calls like
device->DoThing();
device->DoOtherThing();
and then sometimes I get a callback like
callback->DoneThing(ThingResult&);
callback->DoneOtherTHing(OtherThingResult&)
I suggest that the user is the key component here. Do the current library users like the interface at the level it is designed? Is the interface consistent, even if it is large?
You seem to want to propose
device->Do(ThingAndOtherThingParameters&)
callback->Done(ThingAndOtherThingResult&)
so to have a single entry point with more complex data.
The downside from a library user perspective may that now I have to use a manual switch() or other type statement to tell what really happened. While the dispatching to the appropriate result callback used to be done for me, now you have made it a burden upon the library user.
Unless this bought me as a user some level of flexibility, that I as as user wanted I would consider this a step backwards.
For your part as an implementor, one suggestion would be to go to the generic form internally, and then offer both interfaces externally. Perhaps the old specific interface could even be auto-generated somehow.
Good Luck.
Well, your question implies that there is a balance between the library's complexity and the client's. When those are the only two choices, one almost always goes with making the client's life easier. However, those are rarely really the only two choices.
Now in the text you talk about a command processing architecture where each command has a different set of data associated with it. In the olden days, this would typically be implemented with a big honking case statement in a loop, where each case called a different routine with different parameters and perhaps some setup code. Grisly. McCabe complexity analysers hate this.
These days what you can do with an OO language is use dynamic dispatch. Create a base abstract "command" class with a standard "handle()" method, and have each different command inherit from it to add their own members (to represent the different "arguments" to the different commands). Then you create a big honking array of these at startup, usually indexed by the command ID. For languages like C++ or Ada it has to be an array of pointers to "command" objects, for the dynamic dispatch to work. Then you can just call the appropriate command object for the command ID you read from the client. The big honking case statement is now handled implicitly by the dynamic dispatch.
Where you can get the big savings in this scenario is in subclassing. Do you have several commands that use the exact same parameters? Make a subclass for them, and then derive all of those commands from that subclass. Do you have several commands that have to perform the same operation on one of the parameters? Make a subclass for them with that one method implemented for that operation, and then derive all those commands from that subclass.
Your first objective should be to produce a library that decouples higher software layers from the hardware. Users of your library shouldn't care that you have a hardware device that can execute a number of functions with a different payload. They should only care what the device does in a higher level. In this sense, it is in my opinion a good thing that every command is mapped to each one function.
My plan will be:
Identify the objects the higher data layers need to get the job done. Model the objects in C++ classes from their perspective, not from the perspective of the hardware
Define the interface of the library using the above objects
Start the implementation of the library. Perhaps an intermediate layer that maps software objects to hardware objects is necessary
There are many things you can do to reduce code repetition. You can use polymorphism. Define a class with the base functionality and extend it. You can also use utility classes, that implement functions needed for many commands.

What's a pattern for getting two "deep" parts of a multi-threaded program talking to each other?

I have this general problem in design, refactoring or "triage":
I have an existing multi-threaded C++ application which searches for data using a number of plugin libraries. With the current search interface, a given plugin receives a search string and a pointer to a QList object. Running on a different thread, the plugin goes out and searches various data sources (locally and on the web) and adds the objects of interest to the list. When the plugin returns, the main program, still on the separate thread, adds this data to the local data store (with further processing), guarding this insertion point using a mutex. Thus each plugin can return data asynchronously.
The QT-base plugin library is based on message passing. There are a fair number of plugins which are already written and tested for the application and they work fairly well.
I would like to write some more plugins and leverage the existing application.
The problem is that the new plugins will need more information from the application. They will to need intermittent access to the local data store itself as they search. So to get this, they would need direct or indirect access both the hash array storing the data and the mutex which guards multiple access to the store. I assume the access would be encapsulated by adding an extra method in a "catalog" object.
I can see three ways to write these new plugins.
When loading a plugin, pass them
a pointer to my "catalog" at the
start. This becomes an extra,
"invisible" interface for the new
plugins. This seems quick, easy,
completely wrong according to OO but
I can't see what the future problems would be.
Add a method/message to the
existing interface so I have a
second function which could be
called for the new plugin libraries,
the message would pass a pointer to
the catalog to the plugins. This
would be easy for the plugins but it
would complicate my main code and
seems generally bad.
Redesign the plugin interface.
This seems "best" according to OO,
could have other added benefits but
would require all sorts of
rewriting.
So, my questions are
A. Can anyone tell me the concrete dangers of option 1?
B. Is there a known pattern that fits this kind of problem?
Edit1:
A typical function for calling the plugin routines looks like:
elsewhere(spec){
QList<CatItem> results;
plugins->getResult(spec, &results);
use_list(results);
}
...
void PluginHandler::getResults(QString* spec, QList<CatItem>* results)
{
if (id->count() == 0) return;
foreach(PluginInfo info, plugins) {
if (info.loaded)
info.obj->msg(MSG_GET_RESULTS, (void*) spec, (void*) results);
}
}
It's a repeated through-out the code. I'd rather extend it than break it.
Why is it "completely wrong according to OO"? If your plugin needs access to that object, and it doesn't violate any abstraction you want to preserve, it is the correct solution.
To me it seems like you blew your abstractions the moment you decided that your plugin needs access to the list itself. You just blew up your entire application's architecture. Are you sure you need access to the actual list itself? Why? What do you need from it? Can that information be provided in a more sensible way? One which doesn't 1) increase contention over a shared resource (and increase the risk of subtle multithreading bugs like race conditions and deadlocks), and 2) doesn't undermine the architecture of the rest of the app (which specifically preserves a separation between the list and its clients, to allow asynchronicity)
If you think it's bad OO, then it is because of what you're fundamentally trying to do (violate the basic architecture of your application), not how you're doing it.
Well, option 1 is option 3, in the end. You are redesigning your plugin API to receive extra data from the main app.
It's a simple redesign that, as long as the 'catalog' is well implemented and hide every implementation detail of your hash and mutex backing store, is not bad, and can serve the purpose well enough IMO.
Now if the catalog leaks implementation details then you would better use messages to query the store, receiving responses with the needed data.
Sorry, I just re-read your question 3 times and I think my answer may have been too simple.
Is your "Catalog" an independent object? If not, you could wrap it as it's own object. The Catalog should be completely safe (including threadsafe)--or better yet immutable.
With this done, it would be perfectly valid OO to pass your catalog to the new plugins. If you are worried about passing them through many layers, you can create a factory for the catalog.
Sorry if I'm still misunderstanding something, but I don't see anything wrong with this approach. If your catalog is an object outside your control, however, such as a database object or collection then you really HAVE to encapsulate it in something you can control with a nice, clean interface.
If your Catalog is used by many pieces across your program, you might look at a factory (which, at it's simplest degrades to a Singleton). Using a factory you should be able to summon your Catalog with a Catalog.getType("Clothes"); or whatever. That way you are giving out the same object to everyone who wants one without passing it around.
(this is very similar to a singleton, by the way, but coding it as a factory reminds you that there will almost certainly be more than one--also remember to allow a Catalog.setType("Clothes", ...); for testing.