This question already has answers here:
Faster input and output
(2 answers)
Closed 8 years ago.
I've been strugling with a problem of reading from standard input as fast as possible. My task is to do a few ad hoc operations with input integers. The direct problem is a sequence of integers separated with spaces and with an EOF as a flag indicating an end of the input. I've already used a
while (cin>>a) {
//here are the operations
}
but that's not fast enough. I've heard that writing my own function reading only integers might be a way but I've got no idea how to implement that kind of function. Do u know anything about that?
So what is a faster way of reading standard input in C++? Please write, because that's the solutions for every single project for anyone who yearns for a really fast program...
Thank U in advance and I'm waiting for Your response!
There are at least two bottlenecks, one is out of the control of your program.
Bottleneck 1 - Input into your program.
The first bottleneck is obtaining the data from the input source. Whether it be disk drive, keyboard or mouse, it's going to be slower than your program runs. The OS is in control of this bottleneck.
You can make this more efficient by performing block reads, but more on this later.
Bottleneck 2 - Text to internal representation
The second bottleneck involves converting one or more characters into an internal number. When dealing with text, there is no way around this; it must be done.
Optimizing Text Numeric Input
The fastest method to process text numbers from the input is to read the input into a large buffer and convert the text in memory to a number.
So, using C++ streams as an example:
const unsigned int BUFFER_SIZE = 1024u;
unsigned char buffer[BUFFER_SIZE];
cin.read((char *)buffer, BUFFER_SIZE); // Read 1k at a time.
The next step, coverting text to internal representation, has many possibilities.
I suggest using std::istringstream.
Reading from cin
When reading from cin, all optimizations can be thrown in the trash. Your waiting time for the User to enter a number is so huge compared to the processor's execution times that optimization won't gain you any significance. For example, if I optimize your program to save you 100 milliseconds, the User won't see any effect because the user is taking thousands of milliseconds to enter a number (even at high typing speeds of 60 words per minute). If you want faster processing, you'll have to put the text into a file.
Microoptimizations
This kind of optimization is called microoptimizations. You may be optimizing an area of code that is not as executed as much as another area. Also, if you make changes to your program, the optimization could be wrecked.
Concentrate on correctness and robustness before you optimize.
Related
I am trying to write a program to split a large collection of gene sequences into many files based on values inside a certain segment of each sequence. For example the sequences might look like
AGCATGAGAG...
GATCAGGTAA...
GATGCGATAG...
... 100 million more
The goal is then to split the reads into individual files based on the sequences from position 2 to 7 (6 bases). So we get something like
AAAAAA.txt.gz
AAAAAC.txt.gz
AAAAAG.txt.gz
...4000 more
Now naively I have implemented a C++ program that
reads in each sequence
opens the relevant file
writes in the sequence
closes the file
Something like
#include <zlib.h>
void main() {
SeqFile seq_file("input.txt.gz");
string read;
while (read = seq_file.get_read) {
string tag = read.substr(1, 6);
output_path = tag + "txt.gx";
gzFile output = gzopen(output_path.c_str(), "wa");
gzprintf(output, "%s", read);
gzclose(output);
}
}
This is unbearably slow compared to just writing the whole contents into a single other file.
What is the bottleneck is this situation and how might I improve performance given that I can't keep all the files open simultaneously due to system limits?
Since opening a file is slow, you need to reduce the number of files you open. One way to accomplish this is to make multiple passes over your input. Open a subset of your output files, make a pass over the input and only write data to those files. When you're done, close all those files, reset the input, open a new subset, and repeat.
The bottleneck is opening and closing of the output file. If you can move this out of the loop somehow, e.g. by keeping multiple output files open simultaneously, you program should speed up significantly. In the best case scenario it is possible to keep all 4096 files open at the same time but if you hit some system barrier even keeping a smaller number of files open and doing multiple passes through the file should be faster that opening and closing files in the tight loop.
The compressing might be slowing the writing down, writing to text files then compressing could be worth a try.
Opening the file is a bottleneck. Some of the data could be stored in a container and when it reaches a certain size write the largest set to the corresponding file.
I can't actually answer the question - because to do that, I would need to have access to YOUR system (or a reasonably precise replica). The type of disk and how it is connected, how much and type of memory and model/number of CPU will matter.
However, there are a few different things to consider, and that may well help (or at least tell you that "you can't do better than this").
First find out what takes up the time: CPU or disk-I/O?
Use top or system monitor or some such to measure what CPU-usage your application uses.
Write a simple program that writes a single value (zero?) to a file, without zipping it, for a similar size to what you get in your files. Compare this to the time it takes to write your gzip-file. If the time is about the same, then you are I/O-bound, and it probably doesn't matter much what you do.
If you have lots of CPU-usage, you may want to split the writing work into multiple threads - you obviously can't really do that with the reading, as it has to be sequential (reading gzip in mutliple threads is not easy, if at all possible, so let's not try that). Using one thread per CPU-core, so if you have 4 cores, use 1 to read, and three to write. You may not get 4 times the performance, but you should get a good improvement.
Quite certainly, at some point, you will be bound by the speed of the disk. Then the only option is to buy a better disk (if you haven't already got that!)
I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.
I have a piece of software that performs a set of experiments (C++).
Without storing the outcomes, all experiments take a little over a minute.
The total amount of data generated is equal to 2.5 Gbyte, which is too large to store in memory till the end of the experiment and write to file afterwards.
Therefore I write them in chunks.
for(int i = 0; i < chunkSize;i++){
outfile << results_experiments[i] << endl;
}
where
ofstream outfile("data");
and outfile is only closed at the end.
However when I write them in chunks of 4700 kbytes (actually 4700/Chunksize = size of results_experiments element) the experiments take about 50 times longer (over an hour...). This is unacceptable and makes my prior optimization attempts look rather silly. Especially since these experiments again need to be perfomed using many different parameter settings ect.. (at least 100 times, but preferably more)
Concrete my question is:
What would be the ideal chunksize to write at?
Is there a more efficient way than (or something very inefficient in) the way I write data currently?
Basically: Help me getting the file IO overhead introduced as small as possible..
I think it should be possible to do this a lot faster as copying (writing & reading!) the resulting file (same size), takes me under a minute..
The code should be fairly platform independent and not use any (non standard) libraries (I can provide seperate versions for seperate platforms & more complicated install instructions, but it is a hassle..)
If it is not feasible to get the total experiment time under 5 minutes, without platform/library dependencies (and possible with), I will seriously consider introducing these. (platform is windows, but a trivial linux port should at least be possible)
Thank you for your effort.
For starters not flushing the buffer for every chunk seems like a good idea. It also seems possible to do the IO asynchronously, as it is completely independent of the computation. You can also use mmap to improve the performance of File I/O.
If the output doesn't have to be human-readable, then you could investigate a binary format. Storing data in binary format occupies less space than text format and therefore needs less disk i/o. But there'll be little difference if the data is all strings. So if you write out as much as possible as numbers and not formatted text you could get a big gain.
However I'm not sure if/how this is done with STL iostreams. The C-style way is using fopen(..., "wb") and fwrite(&object, ...).
I think boost::Serialisation can do binary output using << operator.
Also, can you reduce the amount you write? e.g. no formatting or redundant text, just the bare minimum.
Whether endl flushes the buffer when writing to a ofstream is implementation dependent--
You might also try increasing the buffer size of your ofstream
char *biggerbuffer = new char[512000];
outfile.rdbuf()->pubsetbuf(biggerbuffer,512000);
The availability of pubsetbuf may vary depending on your iostream implementation
I'm having trouble optimizing a C++ program for the fastest runtime possible.
The requirements of the code is to output the absolute value of the difference of 2 long integers, fed through a file into the program. ie:
./myprogram < unkownfilenamefullofdata
The file name is unknown, and has 2 numbers per line, separated by a space. There is an unknown amount of test data. I created 2 files of test data. One has the extreme cases and is 5 runs long. As for the other, I used a Java program to generate 2,000,000 random numbers, and output that to a timedrun file -- 18.MB worth of tests.
The massive file runs at 3.4 seconds. I need to break that down to 1.1 seconds.
This is my code:
int main() {
long int a, b;
while (scanf("%li %li",&a,&b)>-1){
if(b>=a)
printf("%li/n",(b-a));
else
printf("%li/n",(a-b));
} //endwhile
return 0;
}//end main
I ran Valgrind on my program, and it showed that a lot of hold-up was in the read and write portion. How would I rewrite print/scan to the most raw form of C++ if I know that I'm only going to be receiving a number? Is there a way that I can scan the number in as a binary number, and manipulate the data with logical operations to compute the difference? I was also told to consider writing a buffer, but after ~6 hours of searching the web, and attempting the code, I was unsuccessful.
Any help would be greatly appreciated.
What you need to do is load the whole string into memory, and then extract the numbers from there, rather than making repeated I/O calls. However, what you may well find is that it simply takes a lot of time to load 18MB off the hard drive.
You can improve greatly on scanf because you can guarantee the format of your file. Since you know exactly what the format is, you don't need as many error checks. Also, printf does a conversion on the new line to the appropriate line break for your platform.
I have used code similar to that found in this SPOJ forum post (see nosy's post half-way down the page) to obtain quite large speed-ups in the reading integers area. You will need to modify it to deal with negative numbers. Hopefully it will give you some ideas about how to write a faster printf function as well, but I would start with replacing scanf and see how far that gets you.
As you suggest the problem is reading all these numbers in and converting from text to binary.
The best improvement would be to write the numbers out from whatever program generates them as binary. This will reduce significantly reduce the amount of data that has to be read from the disk, and slightly reduce the time needed to convert from text to binary.
You say that 2,000,000 numbers occupy 18MB = 9 bytes per number. This includes the spaces and the end of line markers, so sounds reasonable.
Storing the numbers as 4 byte integers will half the amount of data that must be read from the disk. Along with the saving on format conversion, it would be reasonable to expect a doubling of performance.
Since you need even more, something more radical is required. You should consider splitting up the data file onto separate files, each on its own disk and then processing each file in its own process. If you have 4 cores and split the processing up into 4 separate processes and can connect 4 high performace disks, then you might hope for another doubling of the performance. The bottleneck is now the OS disk management, and it is impossible to guess how well the OS will manage the four disks in parallel.
I assume that this is a grossly simplified model of the processing you need to do. If your description is all there is to it, the real solution would be to do the subtraction in the program that writes the test files!
Even better than opening the file in your program and reading it all at once, would be memory-mapping it. ~18MB is no problem for the ~2GB address space available to your program.
Then use strtod to read a number and advance the pointer.
I'd expect a 5-10x speedup compared to input redirection and scanf.
I was trying to solve a problem on InterviewStreet. After some time I determine that I was actually spending the bulk of my time reading the input. This particular question had a lot of input, so that makes some amount of sense. What doesn't make sense is why the varying methods of input had such different performances:
Initially I had:
std::string command;
std::cin >> command;
Replacing it made it noticeably faster:
char command[5];
cin.ignore();
cin.read(command, 5);
Rewriting everything to use scanf made it even faster
char command;
scanf("get_%c", &command);
All told I cut the time reading the input down by about a 1/3.
I'm wondering there is such a variation in performance between these different methods. Additionally, I'm wondering why using gprof didn't highlight the time I was spending in I/O, rather seeming to point the blame to my algorithm.
There is a big variation in these routines because console input speed almost never matters.
And where it does (Unix shell) the code is written in C, reads directly from the stdin device and is efficient.
At the risk of being downvoted, I/O streams are, in general, slower and bulkier than their C counterparts. That's not a reason to avoid using them though in many purposes as they are safer (ever run into a scanf or printf bug? Not very pleasant) and more general (ex: overloaded insertion operator allowing you to output user-defined types). But I'd also say that's not a reason to use them dogmatically in very performance-critical code.
I do find the results a bit surprising though. Out of the three you listed, I would have suspected this to be fastest:
char command[5];
cin.ignore();
cin.read(command, 5);
Reason: no memory allocations needed and straightforward reading of a character buffer. That's also true of your C example below, but calling scanf to read a single character repeatedly isn't anywhere close to optimal either even at the conceptual level, as scanf must parse the format string you passed in each time. I'd be interested in the details of your I/O code as it seems that there is a reasonable possibility of something wrong happening when scanf calls to read a single character turn out to be the fastest. I just have to ask and without meaning to offend, but is the code truly compiled and linked with optimizations on?
Now as to your first example:
std::string command;
std::cin >> command;
We can expect this to be quite a bit slower than optimal for the reason that you're working with a variable-sized container (std::string) which will have to involve some heap allocations to read in the desired buffer. When it comes to stack vs. heap issues, the stack is always significantly faster, so if you can anticipate the maximum buffer size needed in a particular case, a simple character buffer on the stack will beat std::string for input (even if you used reserve). This is likewise true of an array on the stack as opposed to std::vector but these containers are best used for cases where you cannot anticipate the size in advance. Where std::string can be faster would be cases where people might be tempted to call strlen repeatedly where storing and maintaining a size variable would be better.
As to the details of gprof, it should be highlighting those issues. Are you looking at the full call graph as opposed to a flat profile? Naturally the flat profile could be misleading in this case. I'd have to know some further details on how you are using gprof to give a better answer.
gprof only samples during CPU time, not during blocked time.
So, a program may spend an hour doing I/O, and a microsecond doing computation, and gprof will only see the microsecond.
For some reason, this isn't well known.
By default, the standard iostreams are configured to work together and with the C stdio library — in practice this means using cin and cout for things other than interactive input and output tend to be slow.
To get good performance using cin and cout, you need to disable the synchronization with stdio. For high performance input, you might even want to untie the streams.
See the following stackoverflow question for more details.
How to get IOStream to perform better?