I have a perl script which generates a very large data structure (which starts life as an array of array references). This is then written to a text file using some weird home-brew serialisation scheme.
The data from the text file is stored as the value in a key-value store db.
A c++ file then retrieves the data, and deserializes it (into a hashmap, although can potentially be flexible on how this data is structured).
What I'm interested in is finding if there are any good ways of sharing a data structure between perl and c++ (something like Storable, but that is meant for perl->perl not perl->c++). The current method is a headache to maintain, and may not have the best performance.
The most important factors are speed of deserialisation, and the size of the serialized structure in that order. Anyone know of something that might do the trick?

Storable is one way to dump and load perl data structures. I wouldn't actually recommend it for general usage though - it's handy in that it's part of core and easy to use.
But for multi-platform (and language) portability, it's far better to use a standard data representation. Which you choose is probably a matter of what sort of data you're holding in your structure, but core contenders are:
JSON - good for arrays and hashes (key-value).
YAML - Excellent for 'config file' style data (but extends in ways similar to JSON)
And if you must, XML - but bear in mind that XML is designed for documents-with-metadata, and so IMO isn't suitable for most of the applications it's used for.
As standards, they've got documented formatting and parsers are widely available. And implementing your own isn't too hard, if that's the route you want to go. Just make sure you follow the spec and you're good.
Note - that because XML and JSON (and I think YAML?) are recursive, you can parse as a stream, rather than a standalone object. (Trap, process and discard as you hit 'close brackets' in JSON, or 'close tags' in XML).

Design patterns for aggregating heterogeneous tabular data

I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)
There are several phrases that you use in your question that stick out to me: custom handling for each file, representation is somewhat different, complexity manageable. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.
I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.

Tiny C++ YAML reader/writer

I'm writing an embedded C++ program, and need to add serialization/deserialization. The format should be human readable and writeable, and I would much prefer to use (a subset of) a standard format like YAML. I also prefer YAML to JSON since it is more concise.
While yaml-cpp has the exact functionality I'd like, the source code is almost 300K and would almost double my code size, which seems excessive to me just in order to add human readable serialization/deserialization.
Before I start writing my own reader/writer for a subset of YAML, I'd like to first check whether this already exists? I have not been able to find one, but would much prefer to use existing code rather than rolling my own. Are there any C or C++ YAML readers/writers out there of, say, 50K code or less? I only need functionality for the basic data structures (scalar, array, hash), not any advanced stuff.
With many thanks in advance.
The Oops library is doing what you are looking for. It is written for serialization using reflection and supports YAML format as well.

XML vs YAML vs JSON for a 2D RPG [duplicate]

I can't figure out whether or not to use XML, YAML, or JSON for a C++ 2D RPG.
Here are my thoughts:
I need something which is simple to save not just player data, but environment data, such as object (x, y) coordinates; load times; dates; graphics configurations, etc.
I need something flexible, easy to use, and definitely light weight, but powerful to handle the above.
Which is the best choice? I have experience with JSON in JavaScript, but not C++. Are there any good references for parsing JSON in C++ if this is the route to go?
Honestly, if a text file seems like the simplest and most effective solution for something like this (especially if I can just write it to binary), then I'm all ears.
Feel free to provide other suggestions as well.
I would use the simplest thing that satisfies your requirements.
If you don't need hierarchical storage, then flat tabular files are so much easier to deal with than anything else. All you have to do is read lines off disk and split on tab.
If you are looking at more of key/value pair type storage (as opposed to lists of things), then INI files can be reasonable. This format has a lot of flexibility, though reasoning about it can less approachable when you start doing more complicated things than it was designed for.
If you need hierarchical, it's possible that JSON would be simpler. There are JSON libraries in wide range of languages, and it sounds like you already familiar.
sqlite may be another option. There be dragons in SQL, but with a nice C++ wrapper around sqlite, it can be manageable. The primary benefit would be ACID, in my opinion.
The YAML spec looks somewhat lengthy, so I can guess that it has more kitchen sinks. Just skimming the libyaml docs, the API looks somewhat like SAX interfaces that I've used in the past. I have no a posteriori knowledge of it, but I would be reticent to start using it without a good reason.
XML sucks to deal with, don't opt in to it. There lots of reasons for this. I think the most relevant one in my mind is that it's prone to make the code that uses it more complicated than it would be otherwise. Any system I've seen designed with XML, reasoning about the XML is more complicated than the design interests that its trying to support. There are valid uses for it, though it's rare that another storage system wouldn't have been just as adequate.
Regardless of which one you choose, write as little code as you can managing it. You really want to write the classes your engine will use first. Then worry about serializing them. If you let your serialization influence your class design, you'll probably regret it. :)

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"

Good Option for XML Edit/Replace

I have a huge (100k+ lines, 5MB+) XML which acts as a database for my C++ Application. The structure of the XML is quite straight forward, for example, it has chunks of:
<bar prop="true"/>
The nesting of tags is several levels deep and there are many items with multiple properties. What is a good way to find and replace chunks of this kind of a file? For example, assume that the above section is repeated a few dozen times and in each chunk the value of the tag <baz> is different. I'd like to make edits such as:
Setting all the values contained in tag <baz> to a given value.
Remove chunks containing certain values
So far, I've learnt of the following methods for accomplishing this:
Find/Replace: A no-brainer, trivial solution and also my last fall-back. This approach, IMHO is the most time consuming, error prone and painful method. The absolute last resort.
RegExes: Use regular expressions to match blocks of interest and edit them using replacement expressions. Kinda like this blog entry: But I feel this would be error prone and there could be a bunch of missed items if the regex is not exactly right the first time around.
Parser & Save: Whip up a quick program to parse the XML using Xerces or XML DOM Interfaces (or some other XML library), read the XML in, manipulate it as desired and save back to disk. Again, this approach is a slow process, but once its up and running, easy to make modifications and more flexible then RegExes.
Are there any better ways to deal with this?
(EDIT: Thanks for all the redo it to use a DB suggestions, I know its a huge mess but by "better ways to deal with this" I meant the "find/replace" part. )
If you don't want to put the entire document in memory, I would read it using a SAX parser. As you read it, you append the transformed document to a second (or a temp) file. I think it could be pretty fast, and use only a little memory footprint.
Are there any better ways to deal with this?
If you must use XML, you could use an XML database such as BDB XML (which has C++ APIs). It supports XQuery, transactions, etc.
Other options include TinyXML which I've used with success in the past. Quick and easy to use, not necessarily the fastest on a file of that size, but it will get the job done.
What are your actual memory constraints? 5MB is large but not enormous by current RAM standards.
I would use DOM with XPath if you can, it will be a lot less development work than SAX or other stream-based parsing. My problem with SAX is that if you are really using this as a in-memory DB, that implies random access on-demand and SAX is not well-suited for that - you will have to parse and reserialize over and over, whereas once you have the DOM at least you can play with it as you like.
Echo comments about to store in-RAM database info too. Plenty of alternatives that are better suited to this than XML. Maybe you could implement a tactical solution using DOM/XPath and investigate rip-and-replace as a longer-term project.