Handling large amounts of data in C++, need approach

Handling large amounts of data in C++, need approach - c++

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.

In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected

If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.

There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.

Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.

Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.

Related

Is it efficient to read the contents of a file into an unordered_map if it has over 1000 entries

I'm making a hash table that's supposed to give pretty fast lookup time for some values I type before hand. I didn't know how to go about it but my friend said I should make a text file and just have an unordered map that reads from the text file and puts the values in the code before I run it. Is this efficient? Is there a better way to do this?
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?

As said in comments, your idea is good enough unless these structures are really large, megabytes.
If you have reasons to worry about the performance of that, e.g. if you want to support millions of records or very large values, more complicated approaches can be more efficient.
When I only need 64-bit support, I sometimes make a single binary file, optimized for memory mapping the complete one. Specifically, a fixed-size header, then sorted arrays of (key,offset) tuples serving as a primary index (can use binary search there, the OS only fetches required pages from mapped files and it caches them in RAM in quite aggressive manner), then values at the offsets specified in the index.

Use std::map when
You need ordered data.
You would have to print/access the data (in
sorted order). You need predecessor/successor of elements.
Use std::unordered_map when
You need to keep count of some data (Example – strings) and no ordering is
required.
You need single element access i.e. no traversal.
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
Definately you can but i hope you knew that you cannot read file with map fstream is there for that purpose.

packet table with heterogeneous rows

I'm trying to store some data in HDF5 using the C++ API, with several requirements:
an arbitrary number of entries can be stored,
each entry has the same number of rows (of types int and double),
the number and type of the rows should be determined at runtime.
I think the right way to implement this is as a packet table, but the example I've been able to find stores only one native type per entry. I'd like to store several, similar to a compound datatype but again the example I found isn't sufficient because it stores a struct, which can't be written at runtime. Are there some examples where this is done? Or just some high-level API that I missed?

Are you still looking for an answer? I cannot make out what you are after exactly.
A packet table is a special form of dataset. The number of records in a packet table can be unbounded.
Since you mention that setting a size of struct at compile time for a compound datatype does not work for you, you might try to separate your data and correlate it in some way.
Arrays in isolation can be written to a dataset with their rank and size set at run time.
Data within an HDF file can be linked, either with your own method, or using HDF links. You can link the individual array data records together, along with the matching compound data, if any.
Hope that helps.

Data structures to implement unknown table schema in c/c++?

Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.

As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.

To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.

I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.

store data dynamically coming from database - c++

Can you give me a idea about the best way that can i store the data coming from the database dynamically. I know the number of columns before, so i want to create a data structure dynamically that will hold all the data, and i need to reorganize the data to show output. In one word - when we type the query "select * from table" - results will come. how to store the results dynamically. (Using structures , map , lists ..) . Thanks in advance.

In a nutshell, the data structures you use to store the data really depend on your data usage patterns. That is:
Is your need for the data simply to output it? If so, why store the data at all?
If not, will you be performing searches on the data?
Is ordering important?
Will you be performing computations with the data?
How much data will you need to hold?
etc...

An array of strings (StringList in Delphi, not sure what you have in C++), one per row, where each row is a comma-separated string. This can be easily dumped and read into Excel as a .csv file, imported into lots of databases.
Or, an XML document may be best. Or something else. "it depends...."

There are quite a number of options for you preferrably from the STL. Use a class to store one row in a class object or in a string if you don't want to create objects if rows are quite huge and you don't need to access all rows returned.
1) Use a vector - Use smart pointers(shared_ptr) to create objects of the class and push them in a vector. Because of the copying involved in a vector I would use a shared_ptr. Sort it later
2) Use a map/set - Creating and inserting elements may be costly, if you are looking for faster inserts. Look up maybe faster.
3) Hash map - Insertion and look up time is better than a map/set.

compressed vector/array class with random data access

I would like to make "compressed array"/"compressed vector" class (details below), that allows random data access with more or less constant time.
"more or less constant time" means that although element access time isn't constant, it shouldn't keep increasing when I get closer to certain point of the array. I.e. container shouldn't do significantly more calculations (like "decompress everything once again to get last element", and "do almost nothing to get the first") to get one element. Can be probably achieved by splitting array into chunks of compressed data. I.e. accessing one element should take "averageTime" +- some deviation. I could say that I want best-case access time and worst-case access time to be relatively close to average access time.
What are my options (suitable algorithms/already available containers - if there are any)?
Container details:
Container acts as a linear array of identical elements (such as std::vector)
Once container is initialized, data is constant and never changes. Container needs to provide read-only access.
Container should behave like array/std::vector - i.e. values accessed via operator[], there is .size(), etc.
It would be nice if I could make it as template class.
Access to data should be more or less constant-time. I don't need same access time for every element, but I shouldn't have to decompress everything to get last element.
Usage example:
Binary search on data.
Data details:
1. Data is structs mostly consisting of floats and a few ints. There are more floats than ints. No strings.
2. It is unlikely that there are many identical elements in array, so simply indexeing data won't be possible.
3. Size of one element is less than 100 bytes.
4. Total data size per container is between few kilobytes and a few megabytes.
5. Data is not sparse - it is continuous block of elements, all of them are assigned, there are no "empty slots".
The goal of compression is to reduce amount of ram the block takes when compared to uncompressed representation as array, while keeping somewhat reasonable read access performance, and allowing to randomly access elements as array. I.e. data should be stored in compressed form internally, and I should be able to access it (read-only) as if it is a std::vector or similar container.
Ideas/Opinions?

I take it that you want an array whose elements are not stored vanilla, but compressed, to minimize memory usage.
Concerning compression, you have no exceptional insight about the structure of your data, so you're fine with some kind of standard entropy encoding. Ideally, would like like to run GZIP on your whole array and be done with it, but that would lose O(1) access, which is crucial to you.
A solution is to use Huffmann coding together with an index table.
Huffmann coding works by replacing each input symbol (for instance, an ASCII byte) with another symbol of variable bit length, depending on frequency of occurency in the whole stream. For instance, the character E appears very often, so it gets a short bit sequence, while 'W' is seldom and gets a long bit sequence.
E -> 0b10
W -> 0b11110
Now, compress your whole array with this method. Unfortunately, since the output symbols have variable length, you can no longer index your data as before: item number 15 is no longer at stream[15*sizeof(item)].
Fortunately, this problem can solved by using an additional index table index that stores where each item start in the compressed stream. In other words, the compressed data for item 15 can be found at stream[index[15]]; the index table accumulates the variable output lengths.
So, to get item 15, you simply start decompressing the bytes at stream[index[15]]. This works because the Huffmann coding doesn't do anything fancy to the output, it just concatenates the new code words, and you can start decoding inside the stream without having to decode all previous items.
Of course, the index table adds some overhead; you may want to tweak the granularity so that compressed data + index table is still smaller than original data.

Are you coding for an embedded system and/or do you have hundreds or thousands of these containers? If not, while I think this is an interesting theoretical question (+1), I suspect that the slowdown as a result of doing the decompression will be non-trivial and that it would be better to use use a std::vector.
Next, are you sure that the data you're storing is sufficiently redundant that smaller blocks of it will actually be compressible? Have you tried saving off blocks of different sizes (powers of 2 perhaps) and tried running them through gzip as an exercise? It may be that any extra data needed to help the decompression algorithm (depending on approach) would reduce the space benefits of doing this sort of compressed container.
If you decide that it's still reasonable to do the compression, then there are at least a couple possibilities, none pre-written though. You could compress each individual element, storing a pointer to the compressed data chunk. Then index access is still constant, just needing to decompress the actual data. Possibly using a proxy object would make doing the actual data decompression easier and more transparent (and maybe even allow you to use std::vector as the underlying container).
Alternately, std::deque stores its data in chunks already, so you could use a similar approach here. For example std::vector<compressed_data_chunk> where each chunk holds say 10 items compressed together as your underlying container. Then you can still directly index the chunk you need, decompress it, and return the item from the decompressed data. If you want, your containing object (that holds the vector) could even cache the most recently decompressed chunk or two for added performance on consecutive access (although this wouldn't help binary search very much at all).

I've been thinking about this for a while now. From a theoretical point of view I identified 2 possibilities:
Flyweight, because repetition can be lessened with this pattern.
Serialization (compression is some form of serialization)
The first is purely object oriented and fits well I think in general, it doesn't have the disadvantage of messing up pointers for example.
The second seems better adapted here, although it does have a slight disadvantage in general: pointer invalidation + issues with pointer encoding / decoding, virtual tables, etc... Notably it doesn't work if the items refer to each others using pointers instead of indices.
I have seen a few "Huffman coding" solutions, however this means that for each structure one needs to provide a compressing algorithm. It's not easy to generalize.
So I'd rather go the other way and use a compressing library like 'zlib', picking up a fast algorithm like lzo for example.
B* tree (or a variant) with large number of items per node (since it doesn't move) like say 1001. Each node contains a compressed representation of the array of items. Indices are not compressed.
Possibly: cache_view to access the container while storing the last 5 (or so) decompressed nodes or something. Another variant is to implement reference counting and keep the data uncompressed as long as someones got a handle to one of the items in the node.
Some remarks:
if you should a large number of items/keys per node you have near constant access time, for example with 1001 it means that there are only 2 levels of indirection as long as you store less than a million items, 3 levels of indirection for a billion etc...
you can build a readable/writable container with such a structure. I would make it so that I only recompress once I am done writing the node.

Okay, from the best of my understanding, what you want is some kind of accessor template. Basically, create a template adapter that has as its argument one of your element types which it accesses internally via whatever, a pointer, an index into your blob, etc. Make the adapter pointer-like:
const T &operator->(void) const;
etc. since it's easier to create a pointer adapter than it is a reference adapter (though see vector if you want to know how to write one of those). Notice, I made this accessor constant as per your guidelines. Then, pre-compute your offsets when the blob is loaded / compressed and populate the vector with your templated adapter class. Does this make sense? If you need more details, I will be happy to provide.
As for the compression algorithm, I suggest you simply do a frequency analysis of bytes in your blob and then run your uncompressed blob through a hard-coded Huffman encoding (as was more or less suggested earlier), capturing the offset of each element and storing it in your proxy adapter which in turn are the elements of your array. Indeed, you could make this all part of some compression class that compresses and generates elements that can be copy-back-inserted into your vector from the beginning. Again, reply if you need sample code.

Can some of the answers to "What is the best compression algorithm that allows random reads/writes in a file?"
be adapted to your in-memory data?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js