Need some help writing my results to a file

Need some help writing my results to a file - c++

My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey

There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.

How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.

Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.

You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.

Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.

I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.

Related

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?

Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Writing similar contents to many files at once in C++

I am working on a C++ program that needs to write several hundreds of ASCII files. These files will be almost identical. In particular, the size of the files is always exactly the same, with only few characters different between them.
For this I am currently opening up N files with a for-loop over fopen and then calling fputc/fwrite on each of them for every chunk of data (every few characters). This seems to work, but it feels like there should be some more efficient way.
Is there something I can do to decrease the load on the file system and/or improve the speed of this? For example, how taxing is it on the file system to keep hundreds of files open and write to all of them bit by bit? Would it be better to open one file, write that one entirely, close it and only then move on to the next?

If you consider the cost of context switches usually involved on doing any of those syscalls then yes, you should "pigghy back" as much data is possible taing into account the writing time and the lenght of buffers.
Given also the fact that this is primarly an io driven problem maybe a pub sub architecture where the publisher bufferize data for you to give to any subscriber that does the io work (and that also waits for the underlying storage mechanism to be ready) could be a good choice.

You can write just once to one file and then make copies of that file. You can read about how making copies here
This is the sample code from the upper link how to do it in C++:
int main() {
String* path = S"c:\\temp\\MyTest.txt";
String* path2 = String::Concat(path, S"temp");
// Ensure that the target does not exist.
File::Delete(path2);
// Copy the file.
File::Copy(path, path2);
Console::WriteLine(S"{0} copied to {1}", path, path2);
return 0;
}

Without benchmarking your particular system, I would GUESS - and that is probably as best as you can get - that writing a file at a time is better than opening lost of files and writing the data to several files. After all, preparing the data in memory is a minor detail, the writing to the file is the "long process".

I have done some testing now and it seems like, at least on my system, writing all files in parallel is about 60% slower than writing them one after the other (263s vs. 165s for 100 files times 100000000 characters).
I also tried to use ofstream instead of fputc, but fputc seems to be about twice as fast.
In the end, I will probably keep doing what I am doing at the moment, since the complexity of rewriting my code to write one file at a time is not worth the performance improvement.

What is the fastest design to download and convert a large binary file?

I have a 1GB binary file on another system.
Requirement: ftp/download and convert binary to CSV on main system.
The converted file will be magnitudes larger ~ 8GB
What is the most common way of doing something similar to this?
Should this be a two step independent process, download - then convert?
Should I download small chunks at a time and convert while downloading?
I don't know the most efficient way to do this...also what should I be cautions of with files this size?
Any advice is appreciated.
Thank You.
(Visual Studio C++)

I would write a program that converts the binary format and outputs to CSV format. This program would read from stdin and write to stdout.
Then I would call
wget URL_of_remote_binary_file --output-document=- | my_converter_program > output_file.csv
That way you can start converting immediately (without downloading the entire files) and your program doesn't deal with networking. You can also run the program on the remote side, assuming it's portable enough.

Without knowing any specifics, I would go with a binary ftp download and then post-process with a separate conversion program. This would break the process into two distinct and unrelated parts which would aid in building and debugging the overall system. No need to reinvent an FTP system and lots of potential to optimize the post-processing.

To avoid too much traffic I would in a first step compress and transfer the file. The conversion process, if something goes wrong or want another output can be redone locally without refetching the data.
The only precaution is not to load the whole stuff in memory and then convert but do it chunk-wise like you said. You can prevent some unpleasant effects for users of your program by creating/pre-allocating a huge file of the max expected size. This to avoid running out of disk space during the conversion phase. Also some filesystems do not like files bigger than 2GB or 4GB, that would also be caught by the pre-allocation trick.

It depends on your data and your requirements. What performance requirements do you have? Do you need to finish such as task in X amount of time (where speed is critical), or is this something that will just be done periodically (in which case speed is not essential)?
That said, you will certainly get a cleaner implementation if you separate the work out into two tasks - a downloader and a converter. That way each component can be simple and just focus on the task at hand. All things being equal, I recommend this approach.
Otherwise if you try to download/convert at the same time you may get into situations where your downloader has data ready, but the converter needs more data before it can proceed. Again, there is no reason why your code cannot handle this, but it will make the implementation more complicated and that much more difficult to debug / test / validate.

It's usually better to do it as separate processes with no interdependency. If your requirements change in the future you can reuse the pieces, or use them for other projects.

Here are even more guesses about your requirements and possible solutions:
Concerned about file integrity? Implement something that includes integrity checks such as sequence numbers, size fields and checksums/hashes, and just enough transaction semantics so that the system knows whether a transfer completed or didn't.
Are uploads happening on slow/congested links, and may be interrrupted? Implement a protocol that allows the transfer to resume after interruption.
Are uploads recurring, with much of the data unchanged? Implement something amenable to incremental update, so you upload only the differences.

All things equal what is the fastest way to output data to disk in C++?

I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.

Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.

I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.

My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.

Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.

The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.

Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.

Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.

Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.

The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.

A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)

Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)

Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.

From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.

Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.

What about using SQLite? I think you can get away with a single table.

Using memory mapped files should be investigated as it will reduce the number of system calls.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js