I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)
There are several phrases that you use in your question that stick out to me: custom handling for each file, representation is somewhat different, complexity manageable. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.
I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.
Related
I'm trying to parse a file and store the different fields in some variable.
However, I'm not sure of the proper design to do this:
Should I have a class for parsing and a class for storage, or should the parser also be used as storage to hold the fields ?
I also have to keep in mind that I might need to extend the parser as new fields may appear in the file.
Edit: It has been mentionned that the question was too broad. I included more details here but I didn't want to influence the answers with what I already implemented. In my current implementation I faced some issues to extended the parser, and someone pointed out that I shouldn't separate data storage and parser (because of the single responsibility principle) but for me it made sense to separate them.
It's a question of necessity. If your problem is likely to stay small, put the parsing and storage together. You'll have to have the parsed data somewhere in memory regardless, so why not right where it comes in?
The main reason you'd want to separate them is if you need to store in different ways, such as database or file, and you need some sort of interface that can deal with extensibility.
But think about that only if it looks likely to come up.
I have a perl script which generates a very large data structure (which starts life as an array of array references). This is then written to a text file using some weird home-brew serialisation scheme.
The data from the text file is stored as the value in a key-value store db.
A c++ file then retrieves the data, and deserializes it (into a hashmap, although can potentially be flexible on how this data is structured).
What I'm interested in is finding if there are any good ways of sharing a data structure between perl and c++ (something like Storable, but that is meant for perl->perl not perl->c++). The current method is a headache to maintain, and may not have the best performance.
The most important factors are speed of deserialisation, and the size of the serialized structure in that order. Anyone know of something that might do the trick?
Storable is one way to dump and load perl data structures. I wouldn't actually recommend it for general usage though - it's handy in that it's part of core and easy to use.
But for multi-platform (and language) portability, it's far better to use a standard data representation. Which you choose is probably a matter of what sort of data you're holding in your structure, but core contenders are:
JSON - good for arrays and hashes (key-value).
YAML - Excellent for 'config file' style data (but extends in ways similar to JSON)
And if you must, XML - but bear in mind that XML is designed for documents-with-metadata, and so IMO isn't suitable for most of the applications it's used for.
As standards, they've got documented formatting and parsers are widely available. And implementing your own isn't too hard, if that's the route you want to go. Just make sure you follow the spec and you're good.
Note - that because XML and JSON (and I think YAML?) are recursive, you can parse as a stream, rather than a standalone object. (Trap, process and discard as you hit 'close brackets' in JSON, or 'close tags' in XML).
easy job.
I like perl , and I also like C/C++. To make the best of both,
I wrote a github project to solve this issue.
please see:
https://github.com/tlqtangok/perlcpp
a short example is here :
P_eval("$a=2;$a=$a**10;");
Int("a") ; // a= 1024
Suppose you have a lot of source code (like 50GB+) in popular languages (Java, C, C++, etc).
The project needs are:
compressing source code to reduce disk use and disk I/O
indexing it in such way that particular source file can be extracted from the compressed source without decompressing the whole thing
compression time for the whole codebase is not important
search and retrieval time (and memory use when searching and retrieving) are important
This SO answer contains potential answers: What are the lesser known but useful data structures?
However, this is just a list of potentials - I do not know how those structures actually evaluate against requirements listed above.
Question: what are data structures (and their implementations) that would perform well according to the aforementioned requirements?
The main data-structure used for searching is the inverted list. Fortunately, you don't need to implement it yourself. Lucene is a widely used search tool which works with inverted lists internally.
Using Lucene you can create a document with multiple fields. The idea is that some of these fields will be searchable with standard keyword-type queries.
I've implemented a source code search utility which I'll now describe briefly in the following paragraphs. The entire source code itself is stored as a non-indexable field named "code" (you can modify the source to store a compressed version).
For the retrieval part, note that the keywords that you are going to use for the search could be names of function, classes, packages or variables. They could also be words from the comments and so on. In my implementation, I extracted these information from using a Java annotated syntax tree (AST). You could do the same for other languages as well by making use of an appropriate parser to construct an AST.
Another possibility is the query-by-example (QBE) paradigm, where you could use a small snippet of code to search approximately similar snippets from your indexed code-base. This is particularly helpful for detecting source-code reuse and plagiarism, the main purpose for which I developed the tool.
The project page is here. I call it the YASOCS (Yet Another SOurce Code Searcher).
The search is very fast since it uses the inverted list. You could also use Luke (an open source Lucene index visualizer) to "see" the index yourself and execute test queries using the interface.
Problem domain
I'm working on a rather big application, which uses a hierarchical data model. It takes images, extracts images' features and creates analysis objects on top of these. So the basic model is like Object-(1:N)-Image_features-(1:1)-Image. But the same set of images may be used to create multiple analysis objects (with different options).
Then an object and image can have a lot of other connected objects, like the analysis object can be refined with additional data or complex conclusions (solutions) can be based on the analysis object and other data.
Current solution
This is a sketch of the solution. Stacks represent sets of objects, arrows represent pointers (i.e. image features link to their images, but not vice versa). Some parts: images, image features, additional data, may be included in multiple analysis objects (because user wants to make analysis on different sets of object, combined differently).
Images, features, additional data and analysis objects are stored in global storage (god-object). Solutions are stored inside analysis objects by means of composition (and contain solution features in turn).
All the entities (images, image features, analysis objects, solutions, additional data) are instances of corresponding classes (like IImage, ...). Almost all the parts are optional (i.e., we may want to discard images after we have a solution).
Current solution drawbacks
Navigating this structure is painful, when you need connections like the dotted one in the sketch. If you have to display an image with a couple of solutions features on top, you first have to iterate through analysis objects to find which of them are based on this image, and then iterate through the solutions to display them.
If to solve 1. you choose to explicitly store dotted links (i.e. image class will have pointers to solution features, which are related to it), you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes.
My idea
I'd like to build a more extensible (2) and flexible (1) data model. The first idea was to use a relational model, separating objects and their relations. And why not use RDBMS here - sqlite seems an appropriate engine to me. So complex relations will be accessible by simple (left)JOIN's on the database: pseudocode "images JOIN images_to_image_features JOIN image_features JOIN image_features_to_objects JOIN objects JOIN solutions JOIN solution_features") and then fetching actual C++ objects for solution features from global storage by ID.
The question
So my primary question is
Is using RDBMS an appropriate solution for problems I described, or it's not worth it and there are better ways to organize information in my app?
If RDBMS is ok, I'd appreciate any advice on using RDBMS and relational approach to store C++ objects' relationships.
You may want to look at Semantic Web technologies, such as RDF, RDFS and OWL that provide an alternative, extensible way of modeling the world. There are some open-source triple stores available, and some of the mainstream RDBMS also have triple store capabilities.
In particular take a look at Manchester Universities Protege/OWL tutorial: http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/
And if you decide this direction is worth looking at further, I can recommend "SEMANTIC WEB for the WORKING ONTOLOGIST"
Just based on the diagram, I would suggest that an RDBMS solution would indeed work. It has been years since I was a developer on an RDMS (called RDM, of course!), but I was able to renew my knowledge and gain very many valuable insights into data structure and layout very similar to what you describe by reading the fabulous book "The Art of SQL" by Stephane Faroult. His book will go a long way to answer your questions.
I've included a link to it on Amazon, to ensure accuracy: http://www.amazon.com/The-Art-SQL-Stephane-Faroult/dp/0596008945
You will not go wrong by reading it, even if in the end it does not solve your problem fully, because the author does such a great job of breaking down a relation in clear terms and presenting elegant solutions. The book is not a manual for SQL, but an in-depth analysis of how to think about data and how it interrelates. Check it out!
Using an RDBMS to track the links between data can be an efficient way to store and think about the analysis you are seeking, and the links are "soft" -- that is, they go away when the hard objects they link are deleted. This ensures data integrity; and Mssr Fauroult can answer what to do to ensure that remains true.
I don't recommend RDBMS based on your requirement for an extensible and flexible model.
Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.
I strongly recommend using standard C++ OO programming with STL.
You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc
All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.
http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html
"you'll put very much effort maintaining consistency of these pointers
and constantly updating the links when something changes."
With the help of Boost.MultiIndex you can create almost every kind of index on a "table". I think the quoted problem is not so serious, so the original solution is manageable.
If an application has to be able to open (and possibly save) the file format for the last N releases, how should the code be organized sanely so that updating the file format is easy and less error-prone? Assume the file format is in XML, and functions take in objects for export and produce objects for import.
Append a number to the end of each function name, and copy/paste it and increment the number for each new version? That's like maintaining multiple versions of version-controlled functions within source code. Perhaps do some magic at build time?
Firstly, supporting import of old versions is a lot easier than export. This is because usually later versions are different because they support more features. Hence saving to an old format may well mean loss of data. Consequently, my experience has only been on supporting import of multiple versions, spanning over a decade.
XML is of course the smart solution. It is designed with this problem in mind. The key point to me is that clean code structure follows from a clean data model. Provided new versions add features and these are represented by support for additional tags, you do not really have to recode handling of existing tags at all.
Now you could change the semantics of existing tags, requiring their recoding. Solution: don't do this if you can avoid it. When you add a attribute or tag, make sure you define the default value and then old and new data files are handled seamlessly.
So it seems to me that with care you should be able to avoid many cases where you really have significantly different code for handling the same fields in different file versions. Where this does occur, I am guess there are "special circumstances" (that's life with software). When you design the generic solution you'll have specific use cases in mind, and such special cases may not be handled anyway.
In summary: You'll future proof most efficiently via defining the upgrade path for the data model.
A version number is probably required.
But the best thing is to actually make a design for your XML. And make sure that the XML is structured in an intuitive and natural way. Otherwise the current organisation of your code may leak into the structure of the XML, which makes the XML harder to read for future versions of your product.
When saving enumerated values, don't write the numbers, but the name of the enumerable. If some elements could occur multiple times in principle, but not in your current application, design it as an array in XML. Make sure the numbers you write are in a unit that is logical in the problem domain, and not what your application happens to use right now.
In XML written this way, it should not be hard to support legacy versions of your XML.
Edit:
If you make drastic changes, it can be helpful to just implement a legacy data object that reads the legacy xml. Then you write a conversion method to convert from the old data model to the new one. This helps you to a fresh start esp. if the old data model was badly designed.