I am working on a C++ program that needs to write several hundreds of ASCII files. These files will be almost identical. In particular, the size of the files is always exactly the same, with only few characters different between them.
For this I am currently opening up N files with a for-loop over fopen and then calling fputc/fwrite on each of them for every chunk of data (every few characters). This seems to work, but it feels like there should be some more efficient way.
Is there something I can do to decrease the load on the file system and/or improve the speed of this? For example, how taxing is it on the file system to keep hundreds of files open and write to all of them bit by bit? Would it be better to open one file, write that one entirely, close it and only then move on to the next?
If you consider the cost of context switches usually involved on doing any of those syscalls then yes, you should "pigghy back" as much data is possible taing into account the writing time and the lenght of buffers.
Given also the fact that this is primarly an io driven problem maybe a pub sub architecture where the publisher bufferize data for you to give to any subscriber that does the io work (and that also waits for the underlying storage mechanism to be ready) could be a good choice.
You can write just once to one file and then make copies of that file. You can read about how making copies here
This is the sample code from the upper link how to do it in C++:
int main() {
String* path = S"c:\\temp\\MyTest.txt";
String* path2 = String::Concat(path, S"temp");
// Ensure that the target does not exist.
File::Delete(path2);
// Copy the file.
File::Copy(path, path2);
Console::WriteLine(S"{0} copied to {1}", path, path2);
return 0;
}
Without benchmarking your particular system, I would GUESS - and that is probably as best as you can get - that writing a file at a time is better than opening lost of files and writing the data to several files. After all, preparing the data in memory is a minor detail, the writing to the file is the "long process".
I have done some testing now and it seems like, at least on my system, writing all files in parallel is about 60% slower than writing them one after the other (263s vs. 165s for 100 files times 100000000 characters).
I also tried to use ofstream instead of fputc, but fputc seems to be about twice as fast.
In the end, I will probably keep doing what I am doing at the moment, since the complexity of rewriting my code to write one file at a time is not worth the performance improvement.
Related
I am trying to write a program to split a large collection of gene sequences into many files based on values inside a certain segment of each sequence. For example the sequences might look like
AGCATGAGAG...
GATCAGGTAA...
GATGCGATAG...
... 100 million more
The goal is then to split the reads into individual files based on the sequences from position 2 to 7 (6 bases). So we get something like
AAAAAA.txt.gz
AAAAAC.txt.gz
AAAAAG.txt.gz
...4000 more
Now naively I have implemented a C++ program that
reads in each sequence
opens the relevant file
writes in the sequence
closes the file
Something like
#include <zlib.h>
void main() {
SeqFile seq_file("input.txt.gz");
string read;
while (read = seq_file.get_read) {
string tag = read.substr(1, 6);
output_path = tag + "txt.gx";
gzFile output = gzopen(output_path.c_str(), "wa");
gzprintf(output, "%s", read);
gzclose(output);
}
}
This is unbearably slow compared to just writing the whole contents into a single other file.
What is the bottleneck is this situation and how might I improve performance given that I can't keep all the files open simultaneously due to system limits?
Since opening a file is slow, you need to reduce the number of files you open. One way to accomplish this is to make multiple passes over your input. Open a subset of your output files, make a pass over the input and only write data to those files. When you're done, close all those files, reset the input, open a new subset, and repeat.
The bottleneck is opening and closing of the output file. If you can move this out of the loop somehow, e.g. by keeping multiple output files open simultaneously, you program should speed up significantly. In the best case scenario it is possible to keep all 4096 files open at the same time but if you hit some system barrier even keeping a smaller number of files open and doing multiple passes through the file should be faster that opening and closing files in the tight loop.
The compressing might be slowing the writing down, writing to text files then compressing could be worth a try.
Opening the file is a bottleneck. Some of the data could be stored in a container and when it reaches a certain size write the largest set to the corresponding file.
I can't actually answer the question - because to do that, I would need to have access to YOUR system (or a reasonably precise replica). The type of disk and how it is connected, how much and type of memory and model/number of CPU will matter.
However, there are a few different things to consider, and that may well help (or at least tell you that "you can't do better than this").
First find out what takes up the time: CPU or disk-I/O?
Use top or system monitor or some such to measure what CPU-usage your application uses.
Write a simple program that writes a single value (zero?) to a file, without zipping it, for a similar size to what you get in your files. Compare this to the time it takes to write your gzip-file. If the time is about the same, then you are I/O-bound, and it probably doesn't matter much what you do.
If you have lots of CPU-usage, you may want to split the writing work into multiple threads - you obviously can't really do that with the reading, as it has to be sequential (reading gzip in mutliple threads is not easy, if at all possible, so let's not try that). Using one thread per CPU-core, so if you have 4 cores, use 1 to read, and three to write. You may not get 4 times the performance, but you should get a good improvement.
Quite certainly, at some point, you will be bound by the speed of the disk. Then the only option is to buy a better disk (if you haven't already got that!)
I'm writing code that occasionally needs to write data to a file, then send that file to another program for analysis, and repeat the process.
The format of the file is very rigid; headers are required, but they are unchanging and only about 10 lines. So I have two options:
1. Write a function to delete lines from the end of a file until I reach the header section.
2. Remove the old file and create a new file with the same name in its place, rewriting the header part every time.
So my question is this: are there significant efficiency issues in file creation and deletion? It seems easier to write that than to have to write a dynamic deleteLines() function, but I'm curious about the overhead involved. If it matters, I'm working in C++.
The question is, what actions do the different methods entail? Here are some answers:
Truncating a file means
Updating the inode controlling the file
Updating the filesystems information on free blocks
Deleting a file means
Updating the the directory that contains the link to the file
Decrementing the files reference count and updating the filesystems information on free blocks as necessary
Creating a file means
Creating an inode for it
Updating the the directory that is to contain the file
Updating the filesystems information on free blocks
Adding data to an empty file means
Allocating a block for the data, updating the filesystems information on free blocks
Updating the inode controlling the file
I think, it is clear that deleting/creating/appending a file entails quite a few more operations than simply truncating the file after the header.
However, as others have noted, if you want speed, use pipes or shared memory regions (for details look at the documentation of mmap()) or similar stuff. Disk are among the slowest thing ever built into a computer...
Ps: Ignoring performance while designing/choosing the algorithms is the evil root of all slow code... in this respect you better listen to Torvalds than to Knuth.
Performance in this case depends on many things, on the underlying file system etc. So, benchmark it. It will be quite easy to write and will give you the best answer.
And keep in mind Donand Knuth's statement:
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil.
Deleting the old file and writing a new one is probably faster since you would only keep a few bytes. If you modify the existing file, it has to first read the data, then write the new data. If you just go ahead and write there's just the write operation.
But the main point is that just writing the new file is probably far easier to implement and understand, so it should be your default choice unless and until you find that the application is not fast enough and profiling shows this particular piece to be a bottleneck.
I have a piece of software that performs a set of experiments (C++).
Without storing the outcomes, all experiments take a little over a minute.
The total amount of data generated is equal to 2.5 Gbyte, which is too large to store in memory till the end of the experiment and write to file afterwards.
Therefore I write them in chunks.
for(int i = 0; i < chunkSize;i++){
outfile << results_experiments[i] << endl;
}
where
ofstream outfile("data");
and outfile is only closed at the end.
However when I write them in chunks of 4700 kbytes (actually 4700/Chunksize = size of results_experiments element) the experiments take about 50 times longer (over an hour...). This is unacceptable and makes my prior optimization attempts look rather silly. Especially since these experiments again need to be perfomed using many different parameter settings ect.. (at least 100 times, but preferably more)
Concrete my question is:
What would be the ideal chunksize to write at?
Is there a more efficient way than (or something very inefficient in) the way I write data currently?
Basically: Help me getting the file IO overhead introduced as small as possible..
I think it should be possible to do this a lot faster as copying (writing & reading!) the resulting file (same size), takes me under a minute..
The code should be fairly platform independent and not use any (non standard) libraries (I can provide seperate versions for seperate platforms & more complicated install instructions, but it is a hassle..)
If it is not feasible to get the total experiment time under 5 minutes, without platform/library dependencies (and possible with), I will seriously consider introducing these. (platform is windows, but a trivial linux port should at least be possible)
Thank you for your effort.
For starters not flushing the buffer for every chunk seems like a good idea. It also seems possible to do the IO asynchronously, as it is completely independent of the computation. You can also use mmap to improve the performance of File I/O.
If the output doesn't have to be human-readable, then you could investigate a binary format. Storing data in binary format occupies less space than text format and therefore needs less disk i/o. But there'll be little difference if the data is all strings. So if you write out as much as possible as numbers and not formatted text you could get a big gain.
However I'm not sure if/how this is done with STL iostreams. The C-style way is using fopen(..., "wb") and fwrite(&object, ...).
I think boost::Serialisation can do binary output using << operator.
Also, can you reduce the amount you write? e.g. no formatting or redundant text, just the bare minimum.
Whether endl flushes the buffer when writing to a ofstream is implementation dependent--
You might also try increasing the buffer size of your ofstream
char *biggerbuffer = new char[512000];
outfile.rdbuf()->pubsetbuf(biggerbuffer,512000);
The availability of pubsetbuf may vary depending on your iostream implementation
My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey
There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.
How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.
Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.
You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.
Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.
I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.
I'm writing an external merge sort. It works like that: read k chunks from big file, sort them in memory, perform k-way merge, done. So I need to sequentially read from different portions of the file during the k-way merge phase. What's the best way to do that: several ifstreams or one ifstream and seeking? Also, is there a library for easy async IO?
Use one ifstream at a time on the same file. More than one wastes resources, and you'd have to seek anyway (because by default the ifstream's file pointer starts at the beginning of the file).
As for a C++ async IO library, check out this question.
EDIT: I originally misunderstood what you are trying to do (this Wikipedia article filled me in). I don't know how much ifstream buffers by default, but you can turn off buffering by using the pubsetbuf(0, 0); method described here, and then do your own buffering. This may be slower, however, than using multiple ifstreams with automatic buffering. Some benchmarking is in order.
Definitely try the multiple streams. Seeking probably throws away internally buffered data (at least within the process, even if the OS retains it in cache), and if the items you're sorting are small that could be very costly indeed.
Anyway, it shouldn't be too hard to compare the performance of your two fstream strategies. Do a simple experiment with k = 2.
Note that there may be a limit on the number of simultaneous open files one process can have (ulimit -n). if you reach that, then you might want to consider using a single stream, but buffering data from each of your k chunks manually.
It might be worth mmapping the file and using multiple pointers, if the file is small enough (equivalently: your address space is large enough).