How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++? - c++

I have several (1-5) very wide (~50,000 columns) .csv files. The files are (.5GB-1GB) in size (avg. size around 500MB). I need to perform a join on the files on a pre-specified column. Efficiency is, of course, the key. Any solutions that can be scaled out to efficiently allow multiple join columns is a bonus, though not currently required. Here are my inputs:
-Primary File
-Secondary File(s)
-Join column of Primary File (name or col. position)
-Join column of Secondary File (name or col. position)
-Left Join or Inner Join?
Output = 1 File with results of the multi-file join
I am looking to solve the problem using a C-based language, but of course an algorithmic solution would also be very helpful.

Assuming that you have a good reason not to use a database (for all I know, the 50,000 columns may constitute such a reason), you probably have no choice but to clench your teeth and build yourself an index for the right file. Read through it sequentially to populate a hash table where each entry contains just the key column and an offset in the file where the entire row begins. The index itself then ought to fit comfortably in memory, and if you have enough address space (i.e. unless you're stuck with 32-bit addressing) you should memory-map the actual file data so you can access and output the appropriate right rows easily as you walk sequentially through the left file.

Your best bet by far is something like Sqlite, there's C++ bindings for it and it's tailor made for lighting fast inserts and queries.
For the actual reading of the data, you can just go row by row and insert the fields in Sqlite, no need for cache-destroying objects of objects :) As an optimization, you should group up multiple inserts in one statement (insert into table(...) select ... union all select ... union all select ...).

If you need to use C or C++, open the file and load the file directly into a database such as MySQL. The C and C++ languages do not have adequate data table structures nor functionality for manipulating the data. A spreadsheet application may be useful, but may not be able to handle the capacities.
That said, I recommend objects for each field (column). Define a record (file specific) as a collection of fields. Read a text line from a file into a string. Let the record load the field data from the string. Store records into a vector.
Create a new record for the destination file. For each record from the input file(s), load the new record using those fields. Finally, for each record, print the contents of each field with separation characters.
An alternative is to whip up a 2 dimensional matrix of strings.
Your performance bottleneck will be I/O. You may want to read huge blocks of data in. The thorn to the efficiency is the variable record length of a CSV file.
I still recommend using a database. There are plenty of free ones out there, such as MySQl.

It depends on what you mean by "join". Are the columns in file 1 the same as in file 2? If so, you just need a merge sort. Most likely a solution based on merge sort is "best". But I agree with #Blindy above that you should use an existing tool like Sqlite. Such a solution is probably more future proof against changes to the column lists.

Related

Can I add a new column without rewriting an entire file?

I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a new column without rewriting the entire file?

The short answer is probably no.
Arrow's in-memory format & libraries support this. You can add a chunked array to a table by just creating a new table (this should be zero-copy).
However, it appears you are talking about storing tables in files. None of the common file formats in use (parquet, csv, feather) support partitioning a table in this way.
Keep in mind, if you are reading a parquet file, you can specify which column(s) you want to read and it will only read the necessary data. So if your goal is only to support individual column retrieval/query then you can just build one large table with all your columns.

DynamoDB Query in a tight loop or scan?

Here is my basic data structure (or the relevant portions anyway) in DynamoDB; I have a files table that holds file data and has an id for the file. I also have a 'Definitions' table that holds items defined in the file. Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file.
Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. However, in some cases I need to get all definitions for a set of files. I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. However I'm not sure how to do this with a query.
I can create a GSI that uses the SourceFile field as the primary key and use that to query against. This sounds like an answer (and may be), however I'm not sure. The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB...
Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? Is there another way to do this that I'm not thinking of?
This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible...

Scans are the most efficient if you expect to be looking for a majority of data in your database. You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small).
The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. With that you can create a GSI on library name/id and query that way.
Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items.

Data Storage and Retrieval in Relational Database

I'm starting a project- Mini Database System, basically a small database like MySQL. I'm planning to use C++, I read several articles and understood that tables will be stored and retrieved using files. Further I need to use B+ trees for accessing and updating of data.
Can someone explain me with example how data will be actually stored inside files,
For example I've a database "test" with table "student" in it.
student(id,name,grade,class) with some of the student entries. So how the entries of this table will be stored inside the file, whether it will stored in single file, or divided into files if later, then how ?

A B+Tree on disk is a bunch of fixed-length blocks. Your program will read/write whole blocks.
Within a block, there are a variable number of records. Those are arranged by some mechanism of your choosing, and need to be ordered in some way.
"Leaf nodes" contain the actual data. In "non-leaf nodes", the "records" contain pointers to child nodes; this is the way BTrees work.
B+Trees have the additional links (and maintenance hassle) of chaining blocks at the same level.
Wikipedia has some good discussions.

Qt splitting data structure into groups

I have a problem I'm trying to solve but I'm at a stand still due to the fact that I'm in the process of learning Qt, which in turn is causing doubts as to what's the 'Qt' way of solving the problem. Whilst being the most efficient in term of time complexity. So I read a file line by line ( file qty ranging between 10-2000,000). At the moment my approach is to dump ever line to a QVector.
Qvector <QString> lines;
lines.append("id,name,type");
lines.append("1,James,A");
lines.append("2,Mark,B");
lines.append("3,Ryan,A");
Assuming the above structure I would like to give the user with three views that present the data based on the type field. The data is comma delimited in its original form. My question is what's the most elegant and possibly efficient way to achieve this ?
Note: For visual aid , the end result kind of emulates Microsoft access. So there will be the list of tables on the left side.In my case these table names will be the value of the grouping field (A,B). And when I switch between those two list items the central view (a table) will refill to contain the particular groups data.
Should I split the data into x amount of structures ? Or would that cause unnecessary overhead ?
Would really appreciate any help

In the end, you'll want to have some sort of a data model that implements QAbstractItemModel that exposes the data, and one or more views connected to it to display it.
If the data doesn't have to be editable, you could implement a custom table model derived from QAbstractTableModel that maps the file in memory (using QFile::map), and incrementally parses it on the fly (implement canFetchMore and fetchMore).
If the data is to be editable, you might be best off throwing it all into a temporary sqlite table as you parse the file, attaching a QSqlTableModel to it, and attaching some views to it.
When the user wants to save the changes, you simply iterate over the model and dump it out to a text file.

Non-permanent huge external data storage in C++ application

I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.

I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js