reading big data in C++

reading big data in C++ - c++

Im using C++ to read large files with over 30000 lines and 3000 colums. (30000 x 3000) matrix. im using a 2d vector to push the read data. But i need to do this process a couple of times. Is there any way to optimize the reading process?

I Will give you some ideas but not exact solution. Because I do not know full details of your system.
Actually if you have this much big file with data and only some data change in next reading. try to use some Data base methodology.
For performance you can use concurrent file reading (read same file part by part by using multiple thread).
If you need to process data as well, then use separate thread(s) to process and may possible to link by a queue or parallel queues.
If your data length is fixed (such as fix length numbers). and if you know the changed location, try to read only changed data instead of reading and processing whole file again and again.
if any of above not helped use memory mapping methodology. If you looking for portability, Boost Memory-Mapped Files will support you to reduce your works

Memory map mechanism is Ok, since there are only reading operations.

Related

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?

Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?

Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.

Writing data to a file in C++ - most efficient way?

In my current project I'm dealing with a big amount of data which is being generated on-the-run by means of a "while" loop. I want to write the data onto a CSV file, and I don't know what's better - should I store all the values in a vector array and write to the file at the end, or write in every iteration?
I guess the first choice it's better, but I'd like an elaborated answer if that's possible. Thank you.

Make sure that you're using an I/O library with buffering enabled, and then write every iteration.
This way your computer can start doing disk access in parallel with the remaining computations.
PS. Don't do anything crazy like flushing after each write, or opening and closing the file each iteration. That would kill efficiency.

The most efficient method to write to a file is to reduce the number of write operations and increase the data written per operation.
Given a byte buffer of 512 bytes, the most inefficient method is to write 512 bytes, one write operation at a time. A more efficient method is to make one operation to write 512 bytes.
There is overhead associated with each call to write to a file. That overhead consists of locating the file on the drive in it's catalog, seeking to the a new location on the drive and writing. The actual operation of writing is quite fast; it's this seeking and waiting for the hard drive to spin up and get ready that wastes your time. So spin it up once, keep it spinning by writing a lot of stuff, then let it spin down. The more data written while the platters are spinning the more efficient the write will be.
Yes, there are caches everywhere along the data path, but all that will be more efficient with large data sizes.
I would recommend writing the the formatted to a text buffer (that is a multiple of 512), and at certain points, flush the buffer to the hard drive. (512 bytes is a common sector size multiple on hard drives).
If you like threads, you can create a thread that monitors the output buffer. When the output buffer reaches a threshold, the thread writes the contents to drive. Multiple buffers can help by having the fast processor fill up buffers while other buffers are written to the slow drive.
If your platform has DMA you might be able to speed things up by having the DMA write the data for you. Although I would expect a good driver to do this automatically.
I do use this technique on an embedded system, using a UART (RS232 port) instead of a hard drive. By using the buffering, I'm able go get about 80% efficiency.
(Loop unrolling may also help.)

The easiest way is in console with > operator. In linux:
./miProgram > myData.txt
Thats get the input of the program and puts in a file.
Sorry for the english :)

On a c++ efficient storage, flushing into file(s) strategy

Here is the situation: A c++ program is endlessly generating data in a regular fashion. The data needs to be stored in persistent storage very quickly so it does not impede the computing time. It is not possible to know the amount of data that will be stored in advance.
After reading this and this posts, I end up following this naive strategy:
Creating one std::ofstream ofs
Opening a new file ofs.open("path/file", std::ofstream::out | std::ofstream::app)
Adding std::string using the operator <<
Closing the file has terminated ofs.close()
Nevertheless, I am still confused about the following:
Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
I have understood that flushing should be done automatically by std::ofstream, I am safe to use it as such? Is there any impact on memory I should be aware of? Do I have to optimize the std::ofstream in some ways (changing its size?)?
Should I be concerned about the file getting bigger and bigger? Should I close it at some point and open a new one?
Does using std::string have some drawbacks? Is there some hidden conversions that could be avoided?
Is using std::ofstream::write() more advantageous?
Thanks for your help.

1.Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
Since all the datatype on any storage device is binary telling compiler to save it so will result in more or less optimized saving of 0's & 1's. It depends on... many things and how you are going to use/read it after. Some of them listed in Writing a binary file in C++ very fast.
When comes to storing on HD, perfomance of your code is always limited to speed of particular HD (which is widespread fact).
Try to give a "certainty/frames" to your questions, they are too general for stating as "problem"

I'm probably not answering your direct questions, but please excuse me trying if I take a step back.
If I understand the issue correctly, the concern is about staying too long writing to disk that would delay the endless data generation.
Perhaps you can allocate a thread just for writing, while processing continues on the main thread.
The writer thread could awake at periodic intervals to write to disk what it has been generated so far.
Communication between the two threads can be either:
two buffers (one active where the generation happens, one frozen, ready to be written to disk on the next batch)
or a queue of data, inserted by the producer and removed by the consumer/writer.

Need some help writing my results to a file

My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey

There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.

How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.

Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.

You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.

Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.

I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js