packet table with heterogeneous rows - c++

I'm trying to store some data in HDF5 using the C++ API, with several requirements:
an arbitrary number of entries can be stored,
each entry has the same number of rows (of types int and double),
the number and type of the rows should be determined at runtime.
I think the right way to implement this is as a packet table, but the example I've been able to find stores only one native type per entry. I'd like to store several, similar to a compound datatype but again the example I found isn't sufficient because it stores a struct, which can't be written at runtime. Are there some examples where this is done? Or just some high-level API that I missed?

Are you still looking for an answer? I cannot make out what you are after exactly.
A packet table is a special form of dataset. The number of records in a packet table can be unbounded.
Since you mention that setting a size of struct at compile time for a compound datatype does not work for you, you might try to separate your data and correlate it in some way.
Arrays in isolation can be written to a dataset with their rank and size set at run time.
Data within an HDF file can be linked, either with your own method, or using HDF links. You can link the individual array data records together, along with the matching compound data, if any.
Hope that helps.

Related

Reading a .txt file using a struct data structure without knowing the size of the structure needed

I need to read data from a .txt file into C++ using the struct data structure. I can do this fine; however, if I don't know the initial size of the data structure needed, what would I do then?
For example, you normally need to define your structure like "Points[20]" if you have 20 pieces of data.
So, is there a way to read a .txt file of any size into a struct data structure?
I have had a look on the site but can't find any question similar to mine.
I think this question has some bad assumptions. It would be better with a more specific example and a code attempt to go off of in order to make sure the OP's concerns are addressed.
However, I can answer that:
A data structure can be a struct or a class or aggregates thereof
A data structure can be made with dynamic sizing or dynamic sizing members
It is important to come up with a specification for your files and the data structures that the file will be deserialized into
For example:
I could say that I will create a file that contains
A leading byte defining how many values will follow, representing an unsigned number 0-255
A number of values one byte each, ranging from 0 - 255
Then I could easily create a parse method that stuffed them into an std::vector of char or std::vector of unsigned or some other dynamic sizing container.
I may or may not create a struct that wraps the vector and the Parse method.
There certainly isn't one way of doing things, but the main point here is that you are looking for dynamically sized containers
Without knowing the structure of the text file, only a general answer can be given.
CSV Files
Many text files are organized as columns (fields) and rows (records). The columns are separated by commas, tabs, vertical bars or other separated characters. Too many duplicates on StackOverflow, so search the internet for "C++ read CSV".
HTML & XML
Some text files have data organized like a tree structure, with records of records. Examples are HTML and XML. Use a library to read these.
Online Judge
These text files contain data for a programming problem. Usually they have one row for the number of test cases, then followed by data for each test case. The organization of the data is problem specific.
Other
Text file can have their data in any, arbitrary, format. You'll need to contact the producer of the data for the layout.
Dynamic Data Quantity
In many instances, the quantity of the data records is not known at compile time. You'll need a dynamically sized container, such as std::vector, std::list, std::map, etc.

c++ and oracle DB and linked lists, which data structures to use?

I am writing a program that receives data over time and I am looking for different patterns in the data.
I have to save data for different processes that I create in the program for future calculations,
I want to save the data to an Oracle DB (that has some support for storing objects).
the information that I want to relate to a new process has the following structure:
list of logic expressions:
stage 1 ->(a*b*c)+(d*e)+..(can have more conditions)
stage 2 ->(f*a*c)+(a*b)+..
stage 3 ->(g*h*i)+(j*k)+..
each letter: a,b, c,d etc represent a logic function that has different parameters related to it, I need to save these parameters for future usage of each logic function.
the * represents logical AND
the + represents logical OR
The question is how to implement it?
I can create an object for each letter, e.g. for "a" (which can be a function or a condition that needs to be check etc) and save the data of this object to the oracle DB.
A numerator can be given to each process to identify it, however I am not sure how to identify each one of the logic functions (e.g. "a") because I need later to assemble the data from the database back to the original process that I am handling (example stage 1).
Regarding linked lists, not sure if to use them in my program to represent the structure of each logic in each stage e.g. a->b->c->(new OR expression)->d->e. or maybe there is a better solution? I can also save this information as a string and try to do parsing later
e.g. string command="stage 1 ->(a*b*c)+(d*e)"
in case that I will be using linked list, I am not sure how to save the structure of the lists to the database.
for the external structure, stage1,stage2, stage3.. etc not sure also if to use linked lists and how to save them to a database.
I would appreciate some advice on how to build it.
Thanks!
Let's build this from the bottom up. You want to refrain from writing your own linked list structures, if possible.
A stage consists of one or more products that are summed.
The products are pointers to functions or function objects, let's call them functors.
Each group of functions objects could be a std::vector<function_object>. These group would have the results multiplied together; which can be handled in a loop.
A stage is one or more of the above vectors. This can be expressed as:
std::vector< std::vector<function_object> >
You can add another dimension for the stages:
std::vector< std::vector< std::vector<function_object> > >
If you prefer to use linked list, replace std::vector with std::list.
Edit 1: Function IDs not objects
Most databases have a difficult time storing code for a function. So, you'll have to use function identifiers instead.
A function identifier is a number associated with a function. This association will be in your code and not in the data. The easiest implementation is to use an array of function objects or pointers. Use the function identifier as an index into the array, the retrieve the functor.
A more robust method is to use a table of <function_id, functor>. This structure allows for the records to be in any order, and the records can be deleted without damaging the code. With the vector, slots must never be removed.
struct Table_Entry
{
unsigned int function_id;
Function_Pointer p_function;
const char * function_name;
};
Table Entry function_associations[] =
{
{5, logic_function_1, "Logic Function 1"},
//...
};

Handling large amounts of data in C++, need approach

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.
In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected
If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.
There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.
Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.
Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.

Data structures to implement unknown table schema in c/c++?

Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.
As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.
To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.
I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.

store data dynamically coming from database - c++

Can you give me a idea about the best way that can i store the data coming from the database dynamically. I know the number of columns before, so i want to create a data structure dynamically that will hold all the data, and i need to reorganize the data to show output. In one word - when we type the query "select * from table" - results will come. how to store the results dynamically. (Using structures , map , lists ..) . Thanks in advance.
In a nutshell, the data structures you use to store the data really depend on your data usage patterns. That is:
Is your need for the data simply to output it? If so, why store the data at all?
If not, will you be performing searches on the data?
Is ordering important?
Will you be performing computations with the data?
How much data will you need to hold?
etc...
An array of strings (StringList in Delphi, not sure what you have in C++), one per row, where each row is a comma-separated string. This can be easily dumped and read into Excel as a .csv file, imported into lots of databases.
Or, an XML document may be best. Or something else. "it depends...."
There are quite a number of options for you preferrably from the STL. Use a class to store one row in a class object or in a string if you don't want to create objects if rows are quite huge and you don't need to access all rows returned.
1) Use a vector - Use smart pointers(shared_ptr) to create objects of the class and push them in a vector. Because of the copying involved in a vector I would use a shared_ptr. Sort it later
2) Use a map/set - Creating and inserting elements may be costly, if you are looking for faster inserts. Look up maybe faster.
3) Hash map - Insertion and look up time is better than a map/set.