What's the best way to write a dataset to a file that is frequently changing?
i.e a 12 meg dataset that has 4 kb segments that change every 2 seconds. Re-writing the entire 12 megs seems like a waste.
Is there anyway to do this using C/C++?
Yes you can save from a particular offset in a file. WIth c it is the seek command so if you look for something similar in C++ you probably will find it.
See http://www.cplusplus.com/reference/clibrary/cstdio/fseek/ for an example
Related
I want to write to a text file with limited size (1KB for example)
and when the file reaches the maximum size I want the first half to be removed and continue writing to the file (append to what remained after removing first half).
is this possible in C++?
for example:
file that reached maximum size:
1 2 3 4 5 6
and I want to continue writing [7,9]
the new file will look like:
4 5 6 7 8 9
You could use a dedicated logging library for example spdlog.
https://github.com/gabime/spdlog
It has a ton of features, including rotating logs.
You can define how big a log file should be, and how many logfiles you want. It can automatically discard the older logs.
If you really want to write it yourself, you have to
keep the content you want to keep in memory, e.g. in a ring buffer.
Whenever something is added to that buffer, you have to rewrite the file. And then flush it, otherwise it is lost if the program crashes.
This can have a big performance impact, so hande with care.
I need to sort huge (10, 20, 30 GB...) text files. So obviously I'm gonna split the file into smaller files and sort them separately and then merge them. But how am I supposed to choose the size of that chunks I want to split the main file into, so it would be the most efficient? If I want to write a cross platform code, I can't get the RAM size.
Is it a way to check can I read at once, say, 1 Gb (or better start with larger number?), if not check 800 Mb and so on? Or there is some other way to handle this?
I have a text file which looks like below:
0.001 ETH Rx 1 1 0 B45678810000000000000000AF0000 555
0.002 ETH Rx 1 1 0 B45678810000000000000000AF 23
0.003 ETH Rx 1 1 0 B45678810000000000000000AF156500
0.004 ETH Rx 1 1 0 B45678810000000000000000AF00000000635254
I need a way to read this file and form a structure and send it to client application.
Currently, I can do this with the help of circular queue by Boost.
The need here is to access different data at different time.
Ex: If I want to access data at 0.03sec while I am currently at 100sec, how can I do this in a best way instead of having file pointer track, or saving whole file to a memory which causes performance bottleneck? (Considering I have a file of size 2 GB with the above kind of data)
Usually the best practice for handling large files depends on the platform architecture (x86/x64) and OS (Windows/Linux etc.)
Since you mentioned boost, have you considered using boost memory mapped file?
Boost Memory Mapped File
Its all depends on
a. how frequently the data access is
b. what pattern the data access is
Splitting the file
If you need to access the data once in a while then this 2GB log
design is fine, if not the logger can be tuned to generate log with
periodic interval/ latter a logic can split the 2GB files into needed fashion of
smaller files. So that fetching the ranged log file and then reading
the log data and then sort out the needed lines is easier since file
read bytes will be reduced here.
Cache
For very frequent data access, for faster response maintaining cache is one the nice solution, again as you said it has its own bottleneck. The size and pattern of the cache memory selection is all depends on the b. what pattern of data access is. Also greater the cache size also slower the response, it should be optimum.
Database
If the searching pattern is un-ordered/dynamically grown on usage then data-base will work. Again here it will not give faster response like small cache.
A mix of database with perfect table organization to support the type of query + smaller cache layer will give optimum result.
Here is the solution I found:
Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
One will continuously parse the file and push to lock free queue
One will continuously read from the buffer, process the line, form a structure and push to another queue
Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.
Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.
If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.
I appreciate if anybody gives solution for the above new issue!
I want to read a big file with csv-data (>1 GB, export from an ERP System) and provide a table interface for the data.
In fact, I have a good working table class. This works in this (abstract) fashion:
a table row which is a vector for the column data
a vector for the rows.
To read the big files this goes in memory problems, I think because the vector does need the whole memory at once on the heap. So I created a new class with has only pointers to strings in the column like this:
a table row which is a vector<string *> for the column data
a vector<row> for the rows.
This works better. It has about 1/3 less memory footprint on the heap. I think the separated string data fits in some holes on the heap ;-)
But if the data gets bigger, the memory problem is also there.
To read the file and convert it takes about 2 minutes.
I tried SQLLite, but the import is very slow. Reading the big file (about 3000000 lines) and insert them, takes about 15 hours. I know I can speed up this a lot, but i do not really know if this is the solution. BTW:The sqlite browser crashes during import such a file!
Does anyone else have such problems or do you know a good way to manage the memory for such BIG-Data-Tables? The table is a look up table for some tasks so it should fit into the memory at once, if possible.
Currently I am working with Visual Studio C++ 2012.
Without knowing to much about your problem, here is what I would do when I had a similar situation 10 years ago and it took 36 hours to dump into a Oracle database, this more than halved to to 16:
Create a bunch of buffers (say 10,000 lines of data) and have a thread read in the data into these in a circular fashion.
Then have another thread start actually working on the data.
Admittedly this only works if each row is independent of others.
Edit: This link about memory locality may help. Essentially use [] instead of vectors.
I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.
An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.