Data structures to implement unknown table schema in c/c++? - c++

Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.

As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.

To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.

I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.

Related

Reading a .txt file using a struct data structure without knowing the size of the structure needed

I need to read data from a .txt file into C++ using the struct data structure. I can do this fine; however, if I don't know the initial size of the data structure needed, what would I do then?
For example, you normally need to define your structure like "Points[20]" if you have 20 pieces of data.
So, is there a way to read a .txt file of any size into a struct data structure?
I have had a look on the site but can't find any question similar to mine.
I think this question has some bad assumptions. It would be better with a more specific example and a code attempt to go off of in order to make sure the OP's concerns are addressed.
However, I can answer that:
A data structure can be a struct or a class or aggregates thereof
A data structure can be made with dynamic sizing or dynamic sizing members
It is important to come up with a specification for your files and the data structures that the file will be deserialized into
For example:
I could say that I will create a file that contains
A leading byte defining how many values will follow, representing an unsigned number 0-255
A number of values one byte each, ranging from 0 - 255
Then I could easily create a parse method that stuffed them into an std::vector of char or std::vector of unsigned or some other dynamic sizing container.
I may or may not create a struct that wraps the vector and the Parse method.
There certainly isn't one way of doing things, but the main point here is that you are looking for dynamically sized containers
Without knowing the structure of the text file, only a general answer can be given.
CSV Files
Many text files are organized as columns (fields) and rows (records). The columns are separated by commas, tabs, vertical bars or other separated characters. Too many duplicates on StackOverflow, so search the internet for "C++ read CSV".
HTML & XML
Some text files have data organized like a tree structure, with records of records. Examples are HTML and XML. Use a library to read these.
Online Judge
These text files contain data for a programming problem. Usually they have one row for the number of test cases, then followed by data for each test case. The organization of the data is problem specific.
Other
Text file can have their data in any, arbitrary, format. You'll need to contact the producer of the data for the layout.
Dynamic Data Quantity
In many instances, the quantity of the data records is not known at compile time. You'll need a dynamically sized container, such as std::vector, std::list, std::map, etc.

Handling large amounts of data in C++, need approach

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.
In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected
If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.
There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.
Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.
Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.

packet table with heterogeneous rows

I'm trying to store some data in HDF5 using the C++ API, with several requirements:
an arbitrary number of entries can be stored,
each entry has the same number of rows (of types int and double),
the number and type of the rows should be determined at runtime.
I think the right way to implement this is as a packet table, but the example I've been able to find stores only one native type per entry. I'd like to store several, similar to a compound datatype but again the example I found isn't sufficient because it stores a struct, which can't be written at runtime. Are there some examples where this is done? Or just some high-level API that I missed?
Are you still looking for an answer? I cannot make out what you are after exactly.
A packet table is a special form of dataset. The number of records in a packet table can be unbounded.
Since you mention that setting a size of struct at compile time for a compound datatype does not work for you, you might try to separate your data and correlate it in some way.
Arrays in isolation can be written to a dataset with their rank and size set at run time.
Data within an HDF file can be linked, either with your own method, or using HDF links. You can link the individual array data records together, along with the matching compound data, if any.
Hope that helps.

Hierarchical filtered lookup in C++

I have been pondering a data structure problem for a while, but can't seem to come up with a good solution. I can not shake off the feeling that the solution is simple and I'm just not seeing it, however, so hopefully you guys can help!
Here is the problem: I have a large collection of objects in memory. Each of them has a number of data fields. Some of the data fields, such as an ID, are unique for each objects, but others, such as a name, can appear in multiple objects.
class Object {
size_t id;
std::string name;
Histogram histogram;
Type type;
...
};
I need to organize these objects in a way that will allow me to quickly (even if the number of objects is relatively large, i.e. millions) filter the collection given a specification of an arbitrary number of object members while all members that are left unspecified count as wildcards. For example, if I specify a given name, I want to retrieve all the objects whose name member equals the given name. However, if I then add a histogram to the query, I would like the query to return only the objects that match in both the name and the histogram fields, and so on. So, for example, I'd like a function
std::set<Object*> retrieve(size_t, std::string, Histogram, Type)
that can both do
retrieve(42, WILDCARD, WILDCARD, WILDCARD)
as well as
retrieve(42, WILDCARD, WILDCARD, Type_foo)
where the second call would return fewer or equally as many objects as the first one. Which data structure allows queries like this and can both be constructed and queried in reasonable time for object counts in the millions?
Thanks for the help!
First you could use Boost Multi-index to implement efficent lookup over differnt members of your Object. This could help to limit the number of elements to consider. As a second step you can simply use a lambda expression to implement a predicate for std::find_if to get first element or use std::copy_if to copy all elements to an target sequence. If you decide to use boost you can use Boost Range with filtering.

C++: Implementing Row Iterator for Table

I have a generic table class implemented in C++ that uses a shared_ptr< ptr_vector< vector<T> > > as its backing, where T is an arbitrary typename; the ptr_vector contains pointers to the vectors corresponding to the columns in the table. I decided to wrap the ptr_vector in a shared_ptr since the tables may contain many millions of rows, and the vectors containing data for each column in a ptr_vector for the same reason. (Please tell me if this can be improved.)
Implementing column-wise operations on this table is trivial, since I have access to the native iterator supplied by the vector. However, I also need the table to support row-wise operations: relatively mundane operations such as adding and removing rows should be supported, as well as the ability to use the STL algorithms with the table. Now, I have run across some design issues that I need some help to address:
It seems that implementing a custom iterator to conduct row-wise operations is necessary to accomplish what is describe above. Would boost::iterator_adaptor be the right way to go about doing this?
When the user adds rows to the table, I do not wish to impose a specific data structure upon the user -- how would I go about doing this? I am thinking of accepting iterators as parameters to the add_row() method.
If you think that I should be implementing this table structure differently, I am open to any suggestions that you may have for me. It was originally designed with the intent to store strings read from tab-delimited files containing hundreds of thousands of row entries.
Thank you very much for your help!
The Boost library has a container called multi_array which provides and n-dimensional dynamic array which can be accessed and manipulated along each dimension. This seems to be very similar to what you are trying to build, perhaps you could use it instead?