Handle huge amount of data in Qt - c++

I need to read some text files that contain a huge amount of data, say 4 files each of about 500MB.
Each file contains several lines and each line has about this format:
id timestamp field1 field2 field3 field4
My strategy so far was to parse each file and for every line creating a QTreeWidgetItem with a suitable number of fields to store that line (this because during the program I want to show some of these data in a QTreeWidget) and appending all these items to a QList.
This QList is stored for all the execution of the program, in this way data are always available and I don't need to parse the files anymore.
I need all the data available because at each moment I need to access to data relative to a particular timestamp interval.
However this strategy seems too expansive in terms of resources, because I saw that the program consumes several GBs of memory and it eventually crashes.
How can I approach in a better way the handling of such data?

What you want is called 'lazy loading'.
There is an Example in the Qt documentation which shows you, how to use QAbstractItemModel, canFetchMore() and fetchMore().

Related

Django: Correct way of passing data object from view to template to view?

From a template users can upload a csv file which gets parsed in
def parseCSV(request):
magic happens here (conforming date formats and all such fun things)
return column names to template
This view returns a list of columns and the user is asked to pick x columns to save.
The users choice is posted to
def saveCSV(request):
logic for saving
Now my question is, how do I most correctly handle the csv data object between view 1 and 2? Do i save it as a temperary file or do i send it back and forth view1->template->view2 as a data object? Or maybe something third?
There is no "correct" way as it all depends on the concrete situation. In this case, it depends on the size of the data from the CSV file. Given that the data is rather large, the best approach is most likely to store the parsed data on the server, and then in the next request only send the user's selection of the full data set.
I would suggest you to parse the data and store it as a JSON blob in the database, so that you can easily retrieve it for the next request. This way you can send the user's selection of rows and columns (or "coordinates"), and save that as real data afterwards. The benefit of storing it right away is that the user can return to the process even after leaving the flow. The downside is, though, that you save unused data, if the user never completes the process, and you might need to clear this later. If you store it in a table containing only temporary data, it should ease the cleaning process.
I would like to parse the CSV file at the frontend and give an option to user to choose columns. After choosing columns, I would send these columns with value to Backend.

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

DynamoDB Query in a tight loop or scan?

Here is my basic data structure (or the relevant portions anyway) in DynamoDB; I have a files table that holds file data and has an id for the file. I also have a 'Definitions' table that holds items defined in the file. Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file.
Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. However, in some cases I need to get all definitions for a set of files. I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. However I'm not sure how to do this with a query.
I can create a GSI that uses the SourceFile field as the primary key and use that to query against. This sounds like an answer (and may be), however I'm not sure. The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB...
Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? Is there another way to do this that I'm not thinking of?
This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible...
Scans are the most efficient if you expect to be looking for a majority of data in your database. You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small).
The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. With that you can create a GSI on library name/id and query that way.
Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items.

Qt splitting data structure into groups

I have a problem I'm trying to solve but I'm at a stand still due to the fact that I'm in the process of learning Qt, which in turn is causing doubts as to what's the 'Qt' way of solving the problem. Whilst being the most efficient in term of time complexity. So I read a file line by line ( file qty ranging between 10-2000,000). At the moment my approach is to dump ever line to a QVector.
Qvector <QString> lines;
lines.append("id,name,type");
lines.append("1,James,A");
lines.append("2,Mark,B");
lines.append("3,Ryan,A");
Assuming the above structure I would like to give the user with three views that present the data based on the type field. The data is comma delimited in its original form. My question is what's the most elegant and possibly efficient way to achieve this ?
Note: For visual aid , the end result kind of emulates Microsoft access. So there will be the list of tables on the left side.In my case these table names will be the value of the grouping field (A,B). And when I switch between those two list items the central view (a table) will refill to contain the particular groups data.
Should I split the data into x amount of structures ? Or would that cause unnecessary overhead ?
Would really appreciate any help
In the end, you'll want to have some sort of a data model that implements QAbstractItemModel that exposes the data, and one or more views connected to it to display it.
If the data doesn't have to be editable, you could implement a custom table model derived from QAbstractTableModel that maps the file in memory (using QFile::map), and incrementally parses it on the fly (implement canFetchMore and fetchMore).
If the data is to be editable, you might be best off throwing it all into a temporary sqlite table as you parse the file, attaching a QSqlTableModel to it, and attaching some views to it.
When the user wants to save the changes, you simply iterate over the model and dump it out to a text file.

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

I have several (1-5) very wide (~50,000 columns) .csv files. The files are (.5GB-1GB) in size (avg. size around 500MB). I need to perform a join on the files on a pre-specified column. Efficiency is, of course, the key. Any solutions that can be scaled out to efficiently allow multiple join columns is a bonus, though not currently required. Here are my inputs:
-Primary File
-Secondary File(s)
-Join column of Primary File (name or col. position)
-Join column of Secondary File (name or col. position)
-Left Join or Inner Join?
Output = 1 File with results of the multi-file join
I am looking to solve the problem using a C-based language, but of course an algorithmic solution would also be very helpful.
Assuming that you have a good reason not to use a database (for all I know, the 50,000 columns may constitute such a reason), you probably have no choice but to clench your teeth and build yourself an index for the right file. Read through it sequentially to populate a hash table where each entry contains just the key column and an offset in the file where the entire row begins. The index itself then ought to fit comfortably in memory, and if you have enough address space (i.e. unless you're stuck with 32-bit addressing) you should memory-map the actual file data so you can access and output the appropriate right rows easily as you walk sequentially through the left file.
Your best bet by far is something like Sqlite, there's C++ bindings for it and it's tailor made for lighting fast inserts and queries.
For the actual reading of the data, you can just go row by row and insert the fields in Sqlite, no need for cache-destroying objects of objects :) As an optimization, you should group up multiple inserts in one statement (insert into table(...) select ... union all select ... union all select ...).
If you need to use C or C++, open the file and load the file directly into a database such as MySQL. The C and C++ languages do not have adequate data table structures nor functionality for manipulating the data. A spreadsheet application may be useful, but may not be able to handle the capacities.
That said, I recommend objects for each field (column). Define a record (file specific) as a collection of fields. Read a text line from a file into a string. Let the record load the field data from the string. Store records into a vector.
Create a new record for the destination file. For each record from the input file(s), load the new record using those fields. Finally, for each record, print the contents of each field with separation characters.
An alternative is to whip up a 2 dimensional matrix of strings.
Your performance bottleneck will be I/O. You may want to read huge blocks of data in. The thorn to the efficiency is the variable record length of a CSV file.
I still recommend using a database. There are plenty of free ones out there, such as MySQl.
It depends on what you mean by "join". Are the columns in file 1 the same as in file 2? If so, you just need a merge sort. Most likely a solution based on merge sort is "best". But I agree with #Blindy above that you should use an existing tool like Sqlite. Such a solution is probably more future proof against changes to the column lists.