I want to read a big file with csv-data (>1 GB, export from an ERP System) and provide a table interface for the data.
In fact, I have a good working table class. This works in this (abstract) fashion:
a table row which is a vector for the column data
a vector for the rows.
To read the big files this goes in memory problems, I think because the vector does need the whole memory at once on the heap. So I created a new class with has only pointers to strings in the column like this:
a table row which is a vector<string *> for the column data
a vector<row> for the rows.
This works better. It has about 1/3 less memory footprint on the heap. I think the separated string data fits in some holes on the heap ;-)
But if the data gets bigger, the memory problem is also there.
To read the file and convert it takes about 2 minutes.
I tried SQLLite, but the import is very slow. Reading the big file (about 3000000 lines) and insert them, takes about 15 hours. I know I can speed up this a lot, but i do not really know if this is the solution. BTW:The sqlite browser crashes during import such a file!
Does anyone else have such problems or do you know a good way to manage the memory for such BIG-Data-Tables? The table is a look up table for some tasks so it should fit into the memory at once, if possible.
Currently I am working with Visual Studio C++ 2012.
Without knowing to much about your problem, here is what I would do when I had a similar situation 10 years ago and it took 36 hours to dump into a Oracle database, this more than halved to to 16:
Create a bunch of buffers (say 10,000 lines of data) and have a thread read in the data into these in a circular fashion.
Then have another thread start actually working on the data.
Admittedly this only works if each row is independent of others.
Edit: This link about memory locality may help. Essentially use [] instead of vectors.
Related
I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.
An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.
I have the following problem to solve. I have to build a graph viewer to view a massive data set.
We have some files in a particular format that has millions of records representing the result of an experiment. Each record represents a sample point on a large graph plot. The biggest file I have seen has 43.7 Million records.
An average file contains 10 Million records. Each record is small (76 Bytes + optional 12 Bytes each). The complete data cannot be loaded in to the main memory as it is too large. I have build a new file format that compresses the data to 48 bytes per record and organises the data in to chunks that are associated to each other. I want to "view" the data by displaying the records in a 2D/3D plot. As the data is very dense, I would like to progressively increase the level of detail by loading more data and removing data that is not shown in the view from the main memory.
I would also like to access group of associated records in real time and pre-load similar records in order to keep the loading time to bare minimum. This will give the user a smooth control to view the data instead of an experience similar to viewing a video on YouTube with a very slow internet connection. the user cannot randomly and has to use the controls to navigate and I would like to use this info to load the relevant records into the main memory.
The data has to be loaded progressively from the disc based on what is currently in the main memory. Records in the main memory that are not required in the current context can be removed and if required re loaded.
How to I access data from a disc at high speeds based on some hash number
How do I manage main memory if the data to be viewed in the current context is too large. If your answer is level of detail, then how do I build it for a large data set and should this data be part of the file ?
I have been working on this for the last two weeks and I seem to get stuck due to IO speed.
I am working in native C++ and I cannot use work under GPL. If you need any more info, let me know.
Ram
Under most modern file systems (Linux, Unixes, Windows) you can map a file into memory.
Which means you can access the content of the file as if it was entirely in memory (eg you can use data[i++], strchr(data,..), etc) and it's the operating system that does the mapping between used memory and file. When you want to read some data that is not already in memory, the o/s will fetch it from the file.
You should read this question's answer: Mmap() an entire large file
I think you are looking for organization similar to what's used to store level geometry in games, just that you maybe (depending on how your program works and what data you need to show) need just one dimension. See Quadtree and similar methods (bottom of that article).
I am reading several files from linux /proc fs and I will have to insert those values in a database. I should be as optimal as possible. So what is cheaper:
i) to cast then to int, while I stored then in memory, for later cast to string again while I build my INSERT statement
ii) or keep them as string, just sanitizing the values (removing ':', spaces, etc...)
iii) What should I take in account to learn to make this decision?
I am already doing a split in the lines, because the order they came is not good enough for me.
Thanks,
Pedro
Edit - Clarification
Sorry guys, my scenario is the following: I am measuring cpu, memory, network, disk, etc... every 10 seconds. We are developing our database system, so I cannot count with anything more than just INSERT statements.
I got interested in this optimization because the frequency off parsing data. Its gonna be write once - there will be no updates over the data after it is written.
You seem to be performing some archiving activity [write-once, read-probably-atmost-once] (storing the DB for a later rare/non-frequent use), if not, you should put the optimization emphasize based on how the data will be read (not written).
If this is the archiving case, maybe inseting BLOBs (binary large objects, [or similar concepts]) into the DB will be more efficient.
Addition:
Apparently it will depend on how you will read the data. Are you just listing the data for browse purpose later on, or there will be more complex fetching queries based on the benchmark values.
For example if you are later performing something like: SELECT * from db.Log WHERE log.time > time1 and Max (Memory) < 5000 then it is best to keep each data in its original format (int in integer, string in String, etc) so that the main data processing is left to DB server.
I am working on a application built uppon WT.
We have a performance problem, as it must display a lot of data in a WTableView associated with a WStandardItemModel.
For each new item to be added in the table it does:
model->setData( row, column, data )
(which happens a few thousand times).
Is there some way to make it faster? some other way to add data in the table?
it can take 2 seconds to generate the data and several minutes to display it ...
WStandardItemModel is a general-purpose model that is easy to use, but it's not optimal for very large models. Try to specialize a WAbstractTableModel; you only need to reimplement three methods and you can read your data from wherever it resides, or compute it on the fly.
It's not normal that a view takes minutes to display. I've used views on tables with many thousands of entries without performance problems. Was your system swapping because of the memory wasted in a (extremely large) WStandardItemModel?
SQLite has a reputation of being small, fast and flexible. I used it in one of my C++ projects to save simple statistics to a file. Once for 15 minutes 3-5 new simple records (5 rows of integers) were saved into the database. During few weeks of such SQLite usage I quickly observed clearly noticeable disk usage. I wasn't expecting that, because amount of data written was very small. If I would write it to a plain text file a reaction of the disk would be hardly noticeable. Is SQLite really such light database, or was my problem too simple for use of a relational database?
VACUUM may solve your problem.
http://www.sqlite.org/lang_vacuum.html
Well, i have used SQLite for storing a table with the content of a English Dictionary with 100000 entries, and it occupied about 20MB, so, i don't think the problem lies on SQLite, but it would be good if you provided more clues in order to get a more accured answer