How to efficiently write to a large number of files - c++

I am trying to write a program to split a large collection of gene sequences into many files based on values inside a certain segment of each sequence. For example the sequences might look like
AGCATGAGAG...
GATCAGGTAA...
GATGCGATAG...
... 100 million more
The goal is then to split the reads into individual files based on the sequences from position 2 to 7 (6 bases). So we get something like
AAAAAA.txt.gz
AAAAAC.txt.gz
AAAAAG.txt.gz
...4000 more
Now naively I have implemented a C++ program that
reads in each sequence
opens the relevant file
writes in the sequence
closes the file
Something like
#include <zlib.h>
void main() {
SeqFile seq_file("input.txt.gz");
string read;
while (read = seq_file.get_read) {
string tag = read.substr(1, 6);
output_path = tag + "txt.gx";
gzFile output = gzopen(output_path.c_str(), "wa");
gzprintf(output, "%s", read);
gzclose(output);
}
}
This is unbearably slow compared to just writing the whole contents into a single other file.
What is the bottleneck is this situation and how might I improve performance given that I can't keep all the files open simultaneously due to system limits?

Since opening a file is slow, you need to reduce the number of files you open. One way to accomplish this is to make multiple passes over your input. Open a subset of your output files, make a pass over the input and only write data to those files. When you're done, close all those files, reset the input, open a new subset, and repeat.

The bottleneck is opening and closing of the output file. If you can move this out of the loop somehow, e.g. by keeping multiple output files open simultaneously, you program should speed up significantly. In the best case scenario it is possible to keep all 4096 files open at the same time but if you hit some system barrier even keeping a smaller number of files open and doing multiple passes through the file should be faster that opening and closing files in the tight loop.

The compressing might be slowing the writing down, writing to text files then compressing could be worth a try.
Opening the file is a bottleneck. Some of the data could be stored in a container and when it reaches a certain size write the largest set to the corresponding file.

I can't actually answer the question - because to do that, I would need to have access to YOUR system (or a reasonably precise replica). The type of disk and how it is connected, how much and type of memory and model/number of CPU will matter.
However, there are a few different things to consider, and that may well help (or at least tell you that "you can't do better than this").
First find out what takes up the time: CPU or disk-I/O?
Use top or system monitor or some such to measure what CPU-usage your application uses.
Write a simple program that writes a single value (zero?) to a file, without zipping it, for a similar size to what you get in your files. Compare this to the time it takes to write your gzip-file. If the time is about the same, then you are I/O-bound, and it probably doesn't matter much what you do.
If you have lots of CPU-usage, you may want to split the writing work into multiple threads - you obviously can't really do that with the reading, as it has to be sequential (reading gzip in mutliple threads is not easy, if at all possible, so let's not try that). Using one thread per CPU-core, so if you have 4 cores, use 1 to read, and three to write. You may not get 4 times the performance, but you should get a good improvement.
Quite certainly, at some point, you will be bound by the speed of the disk. Then the only option is to buy a better disk (if you haven't already got that!)

Related

Writing similar contents to many files at once in C++

I am working on a C++ program that needs to write several hundreds of ASCII files. These files will be almost identical. In particular, the size of the files is always exactly the same, with only few characters different between them.
For this I am currently opening up N files with a for-loop over fopen and then calling fputc/fwrite on each of them for every chunk of data (every few characters). This seems to work, but it feels like there should be some more efficient way.
Is there something I can do to decrease the load on the file system and/or improve the speed of this? For example, how taxing is it on the file system to keep hundreds of files open and write to all of them bit by bit? Would it be better to open one file, write that one entirely, close it and only then move on to the next?
If you consider the cost of context switches usually involved on doing any of those syscalls then yes, you should "pigghy back" as much data is possible taing into account the writing time and the lenght of buffers.
Given also the fact that this is primarly an io driven problem maybe a pub sub architecture where the publisher bufferize data for you to give to any subscriber that does the io work (and that also waits for the underlying storage mechanism to be ready) could be a good choice.
You can write just once to one file and then make copies of that file. You can read about how making copies here
This is the sample code from the upper link how to do it in C++:
int main() {
String* path = S"c:\\temp\\MyTest.txt";
String* path2 = String::Concat(path, S"temp");
// Ensure that the target does not exist.
File::Delete(path2);
// Copy the file.
File::Copy(path, path2);
Console::WriteLine(S"{0} copied to {1}", path, path2);
return 0;
}
Without benchmarking your particular system, I would GUESS - and that is probably as best as you can get - that writing a file at a time is better than opening lost of files and writing the data to several files. After all, preparing the data in memory is a minor detail, the writing to the file is the "long process".
I have done some testing now and it seems like, at least on my system, writing all files in parallel is about 60% slower than writing them one after the other (263s vs. 165s for 100 files times 100000000 characters).
I also tried to use ofstream instead of fputc, but fputc seems to be about twice as fast.
In the end, I will probably keep doing what I am doing at the moment, since the complexity of rewriting my code to write one file at a time is not worth the performance improvement.

method to read multiple lines from a file at once without partial lines

I'm reading in from a CSV file, parsing it, and storing the data, pretty simple.
Right now were using the standard readLine() method to do that, and I'm trying to squeeze some extra efficency out of this processing loop. I don't know how much they hide behind the scenes, but I assume each call to getLine is a new OS call with all the pain that entails? I don't want to pay for OS calls on each line of input. I would provide a huge buffer and have it fill the buffer with many lines at once.
However, I only care about full lines. I don't want to have to handle maintaining partial lines from one buffer read to append to the second buffer read to make a full line, that's just ugly and annoying.
So, is there a method out there that does this for me? It seems like there almost has to be. Any method which I can instruct to read in x number of lines, or x bytes but don't output the last partial line, or even an easy way for me to manage the memory buffer so I minimize the amount of code for handling partial strings would be appreciated. I can use Boost, though if there is a method in standard C++ I would prefer that.
Thanks.
It's very unlikely that you'll be able to do better than the built-in C++ streams. They're quite fast. In general, the fastest way to completely read a file is to use a single thread to read the entire file from start to end, especially if the file is contiguous on disk. Furthermore, it's likely that the disk is much more of a bottleneck during reading than the OS. If you need to improve the performance of your app, I have a few recommendations.
Use a profiler. If your app is reading a line then parsing it or processing it in some way, it's possible that the parsing or processing is something that can be optimized. This can be determined in profiling. If parsing or processing takes up substantial CPU resources, then optimization may be worth the effort.
If you determine that parsing or processing is responsible for a slow application, and that it can't be easily optimized, consider multiprogramming. If the processing of individual lines does not depend on the results of previous lines being processed, then use multiple threads or CPUs to do the processing.
Use pipelining if you have to process multiple files. For example, suppose you have four stages in your app: reading, parsing, processing, saving. It may be more efficient to read one file at a time rather then all of them all at once. However, while reading the second file, you can still parse the first one. While reading the third file, you can parse the second file and process the first one, etc. One way to implement this is a staged mult-threaded application design.
Use RAID to improve disk reads. Certain raid modes can create faster reads and writes.
i am java programmer, but still i have a hint... read the data in a stream. that means for example 4 or 5 times 2048bytes (or much more)... you can iterate over the stream (and convert it) and search for your line-ends(or some other char)... but i think "readLine" is doing the same anyway...

Optimizing IO in C++

I'm having trouble optimizing a C++ program for the fastest runtime possible.
The requirements of the code is to output the absolute value of the difference of 2 long integers, fed through a file into the program. ie:
./myprogram < unkownfilenamefullofdata
The file name is unknown, and has 2 numbers per line, separated by a space. There is an unknown amount of test data. I created 2 files of test data. One has the extreme cases and is 5 runs long. As for the other, I used a Java program to generate 2,000,000 random numbers, and output that to a timedrun file -- 18.MB worth of tests.
The massive file runs at 3.4 seconds. I need to break that down to 1.1 seconds.
This is my code:
int main() {
long int a, b;
while (scanf("%li %li",&a,&b)>-1){
if(b>=a)
printf("%li/n",(b-a));
else
printf("%li/n",(a-b));
} //endwhile
return 0;
}//end main
I ran Valgrind on my program, and it showed that a lot of hold-up was in the read and write portion. How would I rewrite print/scan to the most raw form of C++ if I know that I'm only going to be receiving a number? Is there a way that I can scan the number in as a binary number, and manipulate the data with logical operations to compute the difference? I was also told to consider writing a buffer, but after ~6 hours of searching the web, and attempting the code, I was unsuccessful.
Any help would be greatly appreciated.
What you need to do is load the whole string into memory, and then extract the numbers from there, rather than making repeated I/O calls. However, what you may well find is that it simply takes a lot of time to load 18MB off the hard drive.
You can improve greatly on scanf because you can guarantee the format of your file. Since you know exactly what the format is, you don't need as many error checks. Also, printf does a conversion on the new line to the appropriate line break for your platform.
I have used code similar to that found in this SPOJ forum post (see nosy's post half-way down the page) to obtain quite large speed-ups in the reading integers area. You will need to modify it to deal with negative numbers. Hopefully it will give you some ideas about how to write a faster printf function as well, but I would start with replacing scanf and see how far that gets you.
As you suggest the problem is reading all these numbers in and converting from text to binary.
The best improvement would be to write the numbers out from whatever program generates them as binary. This will reduce significantly reduce the amount of data that has to be read from the disk, and slightly reduce the time needed to convert from text to binary.
You say that 2,000,000 numbers occupy 18MB = 9 bytes per number. This includes the spaces and the end of line markers, so sounds reasonable.
Storing the numbers as 4 byte integers will half the amount of data that must be read from the disk. Along with the saving on format conversion, it would be reasonable to expect a doubling of performance.
Since you need even more, something more radical is required. You should consider splitting up the data file onto separate files, each on its own disk and then processing each file in its own process. If you have 4 cores and split the processing up into 4 separate processes and can connect 4 high performace disks, then you might hope for another doubling of the performance. The bottleneck is now the OS disk management, and it is impossible to guess how well the OS will manage the four disks in parallel.
I assume that this is a grossly simplified model of the processing you need to do. If your description is all there is to it, the real solution would be to do the subtraction in the program that writes the test files!
Even better than opening the file in your program and reading it all at once, would be memory-mapping it. ~18MB is no problem for the ~2GB address space available to your program.
Then use strtod to read a number and advance the pointer.
I'd expect a 5-10x speedup compared to input redirection and scanf.

Need some help writing my results to a file

My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey
There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.
How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.
Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.
You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.
Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.
I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.