Read/Sort a large .CSV File - c++

So conceptually I'm reading in a file with ~2 million lines of data. I'm looking to sort, store and apply other functions to the data later.
I've been told this is referred to as "buckets" but I'm unclear whether this is something pre-defined or a user-defined data type. So I'm curious whether a linked list or array or another combination would be advisable?
Do I need to worry about the size of the file? Will most compilers be able to handle that all going on concurrently or will I need to partition the data first (i.e. divide into each bucket, store in its own file, then use another code, etc)?
If #2 is required, does C++ have the functionality to save multiple files per execution? Meaning a) create bucket1 file.txt; b) populate bucket1 file; close bucket1 file; d) create bucket2 file; ...

OK, so I gather from your post that you are writing this in C++. However, the details are a bit sparse apart from the sorting requirement. But what are you sorting on? Are all fields interpreted as text? Are some numbers? Are there multiple keys?
If you don't absolutely need to write this in C++, and you are on Linux, just invoke /bin/sort to do the sorting. This may seem like a cop-out, but commercial software like Talend even resorts to that.
But if you must write new code in C++, these are my recommendations:
1) Is the CSV file escaped? In other words, do embedded quotes and delimiters need special treatment? You have to figure this out first.
2) Check this out: http://mybyteofcode.blogspot.com/2010/02/parse-csv-file-with-boost-tokenizer-in.html
3) A simple representation of the scanned input is vector<vector<string> >. But it is unwieldy. Instead, wrap a class around vector<string> and make a vector of pointers to those classes, one per line of input, and sort those instead.
4) You should be able to sort ~2M "medium" rows in memory these days. Just use std::sort. But for full generality, you will need to consider, what if it doesn't fit into memory? The most common answer to this is to sort chunks at a time, write the results to disk, and then merge it all using a priority-queue or similar structure.

Related

Is it efficient to read the contents of a file into an unordered_map if it has over 1000 entries

I'm making a hash table that's supposed to give pretty fast lookup time for some values I type before hand. I didn't know how to go about it but my friend said I should make a text file and just have an unordered map that reads from the text file and puts the values in the code before I run it. Is this efficient? Is there a better way to do this?
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
As said in comments, your idea is good enough unless these structures are really large, megabytes.
If you have reasons to worry about the performance of that, e.g. if you want to support millions of records or very large values, more complicated approaches can be more efficient.
When I only need 64-bit support, I sometimes make a single binary file, optimized for memory mapping the complete one. Specifically, a fixed-size header, then sorted arrays of (key,offset) tuples serving as a primary index (can use binary search there, the OS only fetches required pages from mapped files and it caches them in RAM in quite aggressive manner), then values at the offsets specified in the index.
Use std::map when
You need ordered data.
You would have to print/access the data (in
sorted order). You need predecessor/successor of elements.
Use std::unordered_map when
You need to keep count of some data (Example – strings) and no ordering is
required.
You need single element access i.e. no traversal.
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
Definately you can but i hope you knew that you cannot read file with map fstream is there for that purpose.

Algorithm to prevent duplication of field while copying data from text file to excel

I have over 100 text (.txt) files. The data in these files is comma-separated, for example:
good,bad,fine
The aforementioned items are just an example, the data can contain anything from individual words and email-IDs to phone numbers. I have to copy all of these values into a single column of one or more Excel (.xlsx) spreadsheet files.
Now the problem is that I don't want my data to be redundant and the data items should be over 10 million. I'm planning to implement this design in C++. I prefer an algorithm which is efficient so that I can complete my work fast.
Separate your task into 2 steps:
a. getting the list of items in the memory - removing duplicates.
b. putting them into excel.
a: Use a sorted linked tree to gather all items, that's how you will find the duplicate fast.
b: Once done with the list - I would write everything to a simple file and import it to excel, rather than try to do it with c++ against excel API.
After your comment- if a memory problem rise, you might want to create a tree per first letter and - use files to store "each list", so you will not get memory overflow...
it is less efficient, but with today's computing power you won't feel it.
Main idea here is to find fast if you have this word on not and if not - add it to the "list". Searching a sorted tree should do the trick. if you want to avoid worst case scenario there is AVL tree if I recall correctly, this tree will remain even no matter the order of the inserts, yet it is harder to code.
The quickest way would be to simply load all the values (as std::strings) into an std::set. The interface is dead simple. It won't be the fastest way, but you never know.
In particular unordered_set might be faster if you have it available, but will be harder to use because the large number of strings will cause rehashing if you don't prepare properly.
As you read a string, insert it into the set: if(my_set.insert("the string").second) .... That condition will be true for new values only.
10 million is a lot but if each string is around 20 bytes or so, you may get away with a simplistic approach like this.

Handling large amounts of data in C++, need approach

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.
In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected
If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.
There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.
Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.
Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.

Data structures to implement unknown table schema in c/c++?

Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.
As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.
To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.
I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.

compressed string storage

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).
Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.
You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.
Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i] to 1 if a words[i] exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.
If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))
When to use rope:
you are making a template engine. document instances are made from a template by substituting varying data in the template, and then cached for future uses. Parts that are common to templates and instances are stored only once and shared across instances, inserts and deletes are cheap.
When not to use rope:
you are loading many documents from outside the domain of your application (from disk, or over a network) and using them without modification. rope doesn't share strings if they are not copied from one rope to another. If you can afford to do the work to find the common substrings, rope can still be used to improve your final representations.
Like #Saeed mentioned, a simple Huffman coding will perform well here.
There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).