Can you give me a idea about the best way that can i store the data coming from the database dynamically. I know the number of columns before, so i want to create a data structure dynamically that will hold all the data, and i need to reorganize the data to show output. In one word - when we type the query "select * from table" - results will come. how to store the results dynamically. (Using structures , map , lists ..) . Thanks in advance.
In a nutshell, the data structures you use to store the data really depend on your data usage patterns. That is:
Is your need for the data simply to output it? If so, why store the data at all?
If not, will you be performing searches on the data?
Is ordering important?
Will you be performing computations with the data?
How much data will you need to hold?
etc...
An array of strings (StringList in Delphi, not sure what you have in C++), one per row, where each row is a comma-separated string. This can be easily dumped and read into Excel as a .csv file, imported into lots of databases.
Or, an XML document may be best. Or something else. "it depends...."
There are quite a number of options for you preferrably from the STL. Use a class to store one row in a class object or in a string if you don't want to create objects if rows are quite huge and you don't need to access all rows returned.
1) Use a vector - Use smart pointers(shared_ptr) to create objects of the class and push them in a vector. Because of the copying involved in a vector I would use a shared_ptr. Sort it later
2) Use a map/set - Creating and inserting elements may be costly, if you are looking for faster inserts. Look up maybe faster.
3) Hash map - Insertion and look up time is better than a map/set.
Related
I'm making a hash table that's supposed to give pretty fast lookup time for some values I type before hand. I didn't know how to go about it but my friend said I should make a text file and just have an unordered map that reads from the text file and puts the values in the code before I run it. Is this efficient? Is there a better way to do this?
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
As said in comments, your idea is good enough unless these structures are really large, megabytes.
If you have reasons to worry about the performance of that, e.g. if you want to support millions of records or very large values, more complicated approaches can be more efficient.
When I only need 64-bit support, I sometimes make a single binary file, optimized for memory mapping the complete one. Specifically, a fixed-size header, then sorted arrays of (key,offset) tuples serving as a primary index (can use binary search there, the OS only fetches required pages from mapped files and it caches them in RAM in quite aggressive manner), then values at the offsets specified in the index.
Use std::map when
You need ordered data.
You would have to print/access the data (in
sorted order). You need predecessor/successor of elements.
Use std::unordered_map when
You need to keep count of some data (Example – strings) and no ordering is
required.
You need single element access i.e. no traversal.
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
Definately you can but i hope you knew that you cannot read file with map fstream is there for that purpose.
I have over 100 text (.txt) files. The data in these files is comma-separated, for example:
good,bad,fine
The aforementioned items are just an example, the data can contain anything from individual words and email-IDs to phone numbers. I have to copy all of these values into a single column of one or more Excel (.xlsx) spreadsheet files.
Now the problem is that I don't want my data to be redundant and the data items should be over 10 million. I'm planning to implement this design in C++. I prefer an algorithm which is efficient so that I can complete my work fast.
Separate your task into 2 steps:
a. getting the list of items in the memory - removing duplicates.
b. putting them into excel.
a: Use a sorted linked tree to gather all items, that's how you will find the duplicate fast.
b: Once done with the list - I would write everything to a simple file and import it to excel, rather than try to do it with c++ against excel API.
After your comment- if a memory problem rise, you might want to create a tree per first letter and - use files to store "each list", so you will not get memory overflow...
it is less efficient, but with today's computing power you won't feel it.
Main idea here is to find fast if you have this word on not and if not - add it to the "list". Searching a sorted tree should do the trick. if you want to avoid worst case scenario there is AVL tree if I recall correctly, this tree will remain even no matter the order of the inserts, yet it is harder to code.
The quickest way would be to simply load all the values (as std::strings) into an std::set. The interface is dead simple. It won't be the fastest way, but you never know.
In particular unordered_set might be faster if you have it available, but will be harder to use because the large number of strings will cause rehashing if you don't prepare properly.
As you read a string, insert it into the set: if(my_set.insert("the string").second) .... That condition will be true for new values only.
10 million is a lot but if each string is around 20 bytes or so, you may get away with a simplistic approach like this.
So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.
In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected
If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.
There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.
Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.
Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.
Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.
As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.
To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.
I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.
I have a generic table class implemented in C++ that uses a shared_ptr< ptr_vector< vector<T> > > as its backing, where T is an arbitrary typename; the ptr_vector contains pointers to the vectors corresponding to the columns in the table. I decided to wrap the ptr_vector in a shared_ptr since the tables may contain many millions of rows, and the vectors containing data for each column in a ptr_vector for the same reason. (Please tell me if this can be improved.)
Implementing column-wise operations on this table is trivial, since I have access to the native iterator supplied by the vector. However, I also need the table to support row-wise operations: relatively mundane operations such as adding and removing rows should be supported, as well as the ability to use the STL algorithms with the table. Now, I have run across some design issues that I need some help to address:
It seems that implementing a custom iterator to conduct row-wise operations is necessary to accomplish what is describe above. Would boost::iterator_adaptor be the right way to go about doing this?
When the user adds rows to the table, I do not wish to impose a specific data structure upon the user -- how would I go about doing this? I am thinking of accepting iterators as parameters to the add_row() method.
If you think that I should be implementing this table structure differently, I am open to any suggestions that you may have for me. It was originally designed with the intent to store strings read from tab-delimited files containing hundreds of thousands of row entries.
Thank you very much for your help!
The Boost library has a container called multi_array which provides and n-dimensional dynamic array which can be accessed and manipulated along each dimension. This seems to be very similar to what you are trying to build, perhaps you could use it instead?