Parsing huge data with c++ - c++

In my job, i need to parse different kind of data files from different data sources.Sometimes i parse them by writing directly c++ code (with the help of qt and boost:D), sometimes manually with a helper program.
I must note that data types are so different from each other it is so hard to create common a interface for all of them. But i want to do this job in a more generic way.I am planning to write a library to convert them and it should be easy to add new parser utility in future.I am also planning to use other helper programs inside my program, not manually.
My question is what kind of an architecture or pattern do you suggest, Basic condition is library must be extendable via new classes or dll's and also configurable.
By the way data can be in text, ascii or something like CSV(comma seperated values) and most of them are specific for a certain data.

Not to blow my own trumpet, but my small Open Source utility CSVfix has an extensible architecture based on deriving new C++ classes with a very simple interface. I did consider using a plugin-architecture with DLLs but it seemed like overkill for such a simple utility . If interested, you can get the binaries & sources here.

I'd suggest a 3-part model, where the common data-format is a String which should be able to contain every value:
Reader: In this layer the values are read from the source (ie. CSV-file) using some sort of file-format-descriptor. The values are then stored in some sort of intermediate data structure.
Connector/Converter: This layer is responsible for mapping the reader-data to the writer-fields.
Writer: This layer is responsible for writing a specific data structure to the target (ie. another file-format or a database).
This way you can write different Readers for different input files.
I think the hardest part would be creating the definition of the intermediate storage format/structure so that it is future-proof and flexible.

One method I used for defining data structure in my datafile read/write classes is to use std::map<std::string, std::vector<std::string>, string_compare> where the key is the variable name and the vector of strings is the data. While this is expensive in memory, it does not lock me down to only numeric data. And, this method allows for different lengths of data within the same file.
I had the base class implement this generic storage, while the derived classes implemented the reader/writer capability. I then used a factory to get to the desired handler, using another class that determined the file format.

Related

Difference between RWDBTBuffer<T>, RWDBVector<T> and RWDBDecimalVector

I'm writing a python script to generate C++ classes used for database access and they use RogueWave types for data transfer. I have a few template classes I'm looking at to outline how the generated classes should look. When implementing a method for transferring several tuples in one operation, columns are wrapped in RWDBTbuffer, RWDBVector and RWDBDecimalVector.
My problem is, I can't see a direct correlation between the data type that is being wrapped (int, long, RWDateTime, RWDecimalPortable) and the container it is being placed in. It seems to me that I can just put everything in a RWDBTBuffer. What is the advantage of using RWDBDecimalVector over RWDBTBuffer for numeric types, and should RWDBVector ever be used?
in terms of data that they both store there isn't any different.
the main different is that you can shift RWDBVector into RWDBReader and then you can read the data into it.

Serialization with a custom pattern and random access with Boost

I'm asking here because i have already tried to search but i have no idea if this things even exist and what their names are.
I start explaining that with custom pattern i mean this: suppose that i need to serialize objects or data of type foo, bar and boo, usually the library handle this for the user in a very simple way, what comes first goes in first in the serialization process, so if i serialize all the foo first they are written "at the top" of the file and all the bar and boo are after the foo.
Now I would like to keep order in my file and organize things based on a custom pattern, it's this possible with Boost ? What section provides this feature ?
Second thing, that is strictly related to the first one, I also would like to access my serialized binary files in a way that I'm not forced to parse and read all the previous values to extract only the one that I'm interested in, kinda like the RAM that works based on memory address and offers a random access without forcing you to parse all the others addresses.
Thanks.
On the first issue: the Boost serialization library is agnostic as to what happens after it turns an object into its serialized form. It does this by using input and output streams. Files are just that - fostream/fistream. For other types of streams however, the order/pattern that you speak of doesn't make sense. Imagine you're sending serialized objects over the network - the library can't know that it'll have to rearrange the order of objects and, in fact, it can't do that once they've been sent. For this reason, it does not support what you're looking for.
What you can do is create a wrapper that either just caches serialized versions of the objects and arranges them in memory before you tell it to write them out to a file, or that knows that since you're working with files, it can later tellg to the appropriate place in the file and append (this approach would require you to store the locations of the objects you wrote to the file).
As for the second thing - random access file reading. You will have to know exactly where the object is in memory. If you know that the structure of your file won't change, you can seekg on the file stream before handing it to boost for deserialization. If the file structure will change however, you still need to know the location of objects in the file. If you don't want to parse the file to find it, you'll have to store it somewhere during serialization. For example - you can maintain a sort of registry of objects at the top of the file. You will still have to parse it, but it should be just a simple [Object identifier]-[location in file] sort of thing.

C++, creating classes in runtime

I have a query, I have set of flat files ( say file1, file2 etc) containing column names and native data types. ( how values are stored and can be read in c++ is elementary)
eg. flat file file1 may have data like
col1_name=id, col1_type=integer, col2_name=Name, col2_type=string and so on.
So for each flat file I need to create C++ data structure ( i.e 1 flat file = 1 data structure) where the member variable name is same name as column name and its data type will be of C++ native data type like int, float, string etc. according to column type in flat file.
from above eg: my flat file 1 should give me below declaration
class file1{
int id;
string Name;
};
Is there a way I can write code in C++, where binary once created will read the flat file and create data structure based on the file ( class name will be same as flat file name). All the classes created using these flat files will have common functionality of getter and setter member functions.
Do let me know if you have done something similar earlier or have any idea for this.
No, not easily (see the other answers for reasons why not).
I would suggest having a look at Python instead for this kind of problem. Python's type system combined with its ethos of using try/except lends itself more easily to the challenge of parsing data.
If you really must use C++, then you might find a solution using the dynamic properties feature of Qt's QObject class, combined with the QVariant class. Although this would do what you want, I would add a warning that this is getting kind of heavy-weight and may over-complicate your task.
No, not directly. C++ is a compiled language. The code for every class is created by the compiler.
You would need a two-step process. First, write a program that reads those files and translates them into a .cpp file. Second, pass those .cpp files to a compiler.
C++ classes are pure compile-time concepts and have no meaning at runtime, so they cannot be created. However, you could just go with
std::vector<std::string> fields;
and parse as necessary in your accessor functions.
No, but from what I can tell, you have to be able to store the names of multiple columns. What you can do is have a member variable map or unordered_map which you can index with a string - the name of the column - and get some data (like a column object or something) back. That way you can do
obj.Columns["Name"]
I'm not sure there's a design pattern to this, but if your list of possible type names is finite, and known at compile time, can't you declare all those classes in your program before running, and then just instantiate them based on the data in the files?
What you actually want is a field whose exact nature varies at runtime.
There are several approaches, including Boost.Any, but because of the static nature of C++ type system only 2 are really recommended, and both require to have beforehand an idea of all the possible data types that may be required.
The first approach is typical:
Object base type
Int, String, Date whatever derived types
and the use of polymorphism.
The second requires a bit of Boost magic: boost::variant<int, std::string, date>.
Once you have the "variant" part covered, you need to implement visitation to distinguish between the different possible types. Typical visitors for the traditional object-oriented approach or simply boost::static_visitor<> and boost::apply_visitor combinations for the boost approach.
It's fairly straightforward.

Suggestion on C++ object serialization techniques

I'm creating a C++ object serialization library. This is more towards self-learning and enhancements & I don't want to use off-the-shelf library like boost or google protocol buf.
Please share your experience or comments on good ways to go about it (like creating some encoding with tag-value etc).
I would like to start by supporting PODs followed by support to non-linear DSs.
Thanks
PS: HNY2012
If you need serialization for inter process communication, then I suggest to use some interface language (IDL or ASN.1) for defining interfaces.
So it will be easier to make support for other languages (than C++) too. And also, it will be easier to implement code/stub generator.
I have been working on something similar for the last few months. I couldn't use Boost because the task was to serialize a bunch of existing classes (huge existing codebase) and it was inappropriate to have the classes inherit from the interface which had the serialize() virtual function (we did not want multiple inheritance).
The approach taken had the following salient features:
Create a helper class for each existing class, designated with the task of serializing that particular class, and make the helper class a friend of the class being serialized. This avoids introduction of inheritance in the class being serialized, and also allows the helper class access to private variables.
Have each of the helper classes (let's call them 'serializers') register themselves into a global map. Each serializer class implements a clone() virtual function ('prototype' pattern), which allows one to retrieve a pointer to a serializer, given the name of the class, from this map. The name is obtained by using compiler-specific RTTI information. The registration into the global map is taken care of by instantiating static pointers and 'new'ing them, since static variables get created before the program starts.
A special stream object was created (derived from std::fstream), that contained template functions to serialize non-pointer, pointer, and STL data types. The stream object could only be opened in read-only or write-only modes (by design), so the same serialize() function could be used to either read from the file or write into the file, depending on the mode in which the stream was opened. Thus, there is no chance of any mismatch in the order of reading versus writing of the class members.
For every object being saved or restored, a unique tag (integer) was created based on the address of the variable and stored in a map. If the same address occurred again, only the tag was saved, not the deep-copied object itself. Thus, each object was deep copied only once into the file.
A page on the web captures some of these ideas shared above: http://www.cs.sjsu.edu/~pearce/modules/lectures/cpp/Serialization.htm. Hope that helps.
I wrote an article some years ago. Code and tools can be obsolete, but concepts can remain the same.
May be this can help you.

What is the best design pattern to register data "chunks"?

I have a library which can save/load on disk "chunks" which are POD structs with constant size and unique static CHUNK_ID field. So load looks somethink like this.
void Load(int docId, char* ptr, int type, size_t& size)...
If you want to add new chunk you just add struct with new CHUNK_ID and use Save Load functions to it.
What I want is to force all "chunks" to have functions like PrintHumanReadable, CompareThisTypeOfChunk etc(Ideally program should not compile without such functions). Also I want to mark/register/enumerate all chunk-structs.
I have a few ideas but all of them have problems.
Create base class with pure virtual functions PrintHumanReadable, CompareThisTypeOfChunk.
Problem:breaks pod type and requires library rewriting.
Implement factory which creates chunk struct from CHUNK_ID. Problem: compiles when I add new chunk without required functions.
Could you recomend elegant design solution for my problem?
Implement a simple code generator. You can use something like Mako or Cheetah (both Python libraries). Make a text file containing all the class names, then have the generator build the factory method and a series of methods which aren't really used but which refer to the desired methods in all the classes. This will also make it straightforward to enumerate the classes (again, using generated code).
The proper design pattern for this is called "use Boost.Serialization". It's really the best tool for writing objects to a format and then reading them back later. It can write in text, binary, and even XML formats (and others if you write a proper stream for them). It's can be non-intrusive, so you don't need to modify the objects to serialize them. And so forth.
Once you're using the proper tool for this job, you can then use whatever class hierarchy or other method you like to ensure that the proper functions for an object exist.
If you can't/won't use Boost.Serialization, then you're pretty much stuck with a runtime solution. And since the solution is runtime rather than compile time, there's no way to ensure at compile time that any particular chunk ID has the requisite functions.