Suppose you have a lot of source code (like 50GB+) in popular languages (Java, C, C++, etc).
The project needs are:
compressing source code to reduce disk use and disk I/O
indexing it in such way that particular source file can be extracted from the compressed source without decompressing the whole thing
compression time for the whole codebase is not important
search and retrieval time (and memory use when searching and retrieving) are important
This SO answer contains potential answers: What are the lesser known but useful data structures?
However, this is just a list of potentials - I do not know how those structures actually evaluate against requirements listed above.
Question: what are data structures (and their implementations) that would perform well according to the aforementioned requirements?

The main data-structure used for searching is the inverted list. Fortunately, you don't need to implement it yourself. Lucene is a widely used search tool which works with inverted lists internally.
Using Lucene you can create a document with multiple fields. The idea is that some of these fields will be searchable with standard keyword-type queries.
I've implemented a source code search utility which I'll now describe briefly in the following paragraphs. The entire source code itself is stored as a non-indexable field named "code" (you can modify the source to store a compressed version).
For the retrieval part, note that the keywords that you are going to use for the search could be names of function, classes, packages or variables. They could also be words from the comments and so on. In my implementation, I extracted these information from using a Java annotated syntax tree (AST). You could do the same for other languages as well by making use of an appropriate parser to construct an AST.
Another possibility is the query-by-example (QBE) paradigm, where you could use a small snippet of code to search approximately similar snippets from your indexed code-base. This is particularly helpful for detecting source-code reuse and plagiarism, the main purpose for which I developed the tool.
The project page is here. I call it the YASOCS (Yet Another SOurce Code Searcher).
The search is very fast since it uses the inverted list. You could also use Luke (an open source Lucene index visualizer) to "see" the index yourself and execute test queries using the interface.


Sharing data structures between perl and cpp

I have a perl script which generates a very large data structure (which starts life as an array of array references). This is then written to a text file using some weird home-brew serialisation scheme.
The data from the text file is stored as the value in a key-value store db.
A c++ file then retrieves the data, and deserializes it (into a hashmap, although can potentially be flexible on how this data is structured).
What I'm interested in is finding if there are any good ways of sharing a data structure between perl and c++ (something like Storable, but that is meant for perl->perl not perl->c++). The current method is a headache to maintain, and may not have the best performance.
The most important factors are speed of deserialisation, and the size of the serialized structure in that order. Anyone know of something that might do the trick?
Storable is one way to dump and load perl data structures. I wouldn't actually recommend it for general usage though - it's handy in that it's part of core and easy to use.
But for multi-platform (and language) portability, it's far better to use a standard data representation. Which you choose is probably a matter of what sort of data you're holding in your structure, but core contenders are:
JSON - good for arrays and hashes (key-value).
YAML - Excellent for 'config file' style data (but extends in ways similar to JSON)
And if you must, XML - but bear in mind that XML is designed for documents-with-metadata, and so IMO isn't suitable for most of the applications it's used for.
As standards, they've got documented formatting and parsers are widely available. And implementing your own isn't too hard, if that's the route you want to go. Just make sure you follow the spec and you're good.
Note - that because XML and JSON (and I think YAML?) are recursive, you can parse as a stream, rather than a standalone object. (Trap, process and discard as you hit 'close brackets' in JSON, or 'close tags' in XML).
I like perl , and I also like C/C++. To make the best of both,
I wrote a github project to solve this issue.
please see:
a short example is here :
Int("a") ; // a= 1024

Design patterns for aggregating heterogeneous tabular data

I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)
There are several phrases that you use in your question that stick out to me: custom handling for each file, representation is somewhat different, complexity manageable. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.
I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.

is there an application to find identical parts in different files?

i have a legacy HTML website I need to add some features to. Just looking at it, I noticed there are many "common" parts in each HTML file - footer, some script blocks, header, etc. I would like to move all these pieces into separate files (and include them using SSI for now) - that will make understanding of the project much easier. However, there are some blocks which looks similar but are a bit different (different class names for example). So straightforward cut/paste will not work - I will have to carefully examine each piece I remove. And I do not want to do that - there are too many files. I'm wondering if there is an application which can compare a bunch of files and find identical blocks (not necessary present in ALL files).
You want a clone detector.
Many clone detectors will find only identical lines of code., or identical sequences of tokens. Those won't work for you. You want a clone detector that understands how to detect parameterized clones.
Some of the token-based detectors will find clones with only very minor variations as parameters; e.g., if it is just a class name which is different, these may work for you. Such detectors often produce unstructured sequences of clones; the following is clone from the perspective of a token based detector:
} void foo(
void bar(
To avoid such clones, token detectors generally insist on very long sequences of tokens, which means they can miss modest size but interesting clones.
Our CloneDR will find parameterized clones in which the parameter may be complex structure. It does that by parsing the code of interest, and comparing the abstract syntax trees, which represent the essential code minus all the layout and whitespace. Where the trees or sequences of trees are different, it can suggested a parameter representing the entire subtree (e.g., expression, htmt tag group, presence/absence of attributes, etc.). Because it operates on trees, it CANNOT propose the kind of clone above. That in turn means it can find modest size clones that make sense, as well as large ones.
CloneDR operates from precise language descriptions, to produce clones that precisely match language structures. There is a version specifically for HTML.
(I'm the architect; you can see my technical paper on CloneDR at the Wikipedia page.)

How should I document a Lua API/object model written in C++ code?

I am working on documenting a new and expanded Lua API for the game Bitfighter ( Our Lua object model is a subset of the C++ object model, and the methods exposed to Lua that I need to document are a subset of the methods available in C++. I want to document only the items relevant to Lua, and ignore the rest.
For example, the object BfObject is the root of all the Lua objects, but is itself in the middle of the C++ object tree. BfObject has about 40 C++ methods, of which about 10 are relevant to Lua scripters. I wish to have our documentation show BfObject as the root object, and show only those 10 relevant methods. We would also need to show its children objects in a way that made the inheritance of methods clear.
For the moment we can assume that all the code is written in C++.
One idea would be to somehow mark the objects we want to document in a way that a system such as doxygen would know what to look at and ignore the rest. Another would be to preprocess the C++ code in such a way as to delete all the non-relevant bits, and document what remains with something like doxygen. (I actually got pretty far with this approach using luadoc, but could not find a way to make luadoc show object hierarchy.)
One thing that might prove helpful is that every Lua object class is registered in a consistent manner, along with its parent class.
There are a growing number of games out there that use Lua for scripting, and many of them have decent documentation. Does anyone have a good suggestion on how to produce it?
PS To clarify, I'm happy to use any tool that will do the job -- doxygen and luadoc are just examples that I am somewhat familiar with.
I have found a solution, which, while not ideal, works pretty well. I cobbled together a Perl script which rips through all the Bitfighter source code and produces a second set of "fake" source that contains only the elements I want. I can then run this secondary source through Doxygen and get a result that is 95% of what I'm looking for.
I'm declaring victory.
One advantage of this approach is that I can document the code in a "natural" way, and don't need to worry about marking what's in and what's out. The script is smart enough to figure it out from the code structure.
If anyone is interested, the Perl script is available in the Bitfighter source archive at It is only about 80% complete, and is missing a few very important items (such as properly displaying function args), but the structure is there, and I am satisfied the process will work. The script will improve with time.
The (very preliminary) results of the process can be seen at The templates have hardly been modified, so it has a very "stock" look, but it shows that things more-or-less work.
Since some commenters have suggested that it is impossible to generate good documentation with Doxygen, I should note that almost none of our inline docs have been added yet. To get a sense of what they will look like, see the Teleporter class. It's not super good, but I think it does refute the notion that Doxygen always produces useless docs.
My major regret at this point is that my solution is really a one-off and does not address what I think is a growing need in the community. Perhaps at some point we'll standardize on a way of merging C++ and Lua and the task of creating a generalized documentation tool will be more manageable.
PS You can see what the markup in the original source files looks like... see, and search for #luaclass
Exclude either by namespace (could be class as well) of your C++ code, but not the lua code
EXCLUDE_SYMBOLS = myhier_cpp::*
in the doxygen config file or cherry pick what to exclude by using
/// #cond
class aaa {
/// #endcond
in your c++ code.
I personally think that separating by namespace is better since it reflects the separation in code + documentation, which leads to a namespace based scheme for separation of pure c++ from lua bindings.
Separating via exclusion is probably the most targeted approach but that would involve an extra tool to parse the code, mark up relevant lua parts and add the exclusion to the code. (Additionally you could also render special info like graphs separately with this markup and add them via an Image to your documentation, at least that's easy to do with Doxygen.). Since there has to be some kind of indication of lua code, the markup is probably not too difficult to derive.
Another solution is to use LDoc. It also allows you to write C++ comments, which will be parsed by LDoc and included into the documentation.
An advantage is that you can just the same tool to document your lua code as well. A drawback is that the project seems to be unmaintained. It may also not be possible to document complex object hierarchies, like the questioner mentioned.
I forked it myself for some small adjustments regarding c++. Have a look here.

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"