I was helping someone with a question about outputting in C, and I was unable to answer this seemingly simple question I wanted to use the answer to (in my answer), that is:
What's the fastest way to output to a file in C / C++?
I've done a lot of work with prime number generation and mathematical algorithm optimization, using C++ and Java, and this was the biggest holdup for me sometimes - I sometimes need to move a lot to a file and fast.
Forgive me if this has been answered, but I've been looking on google and SO for some time to no avail.
I'm not expecting someone to do the work of benchmarking - but there are several ways to put to file and I doubt I know them all.
So to summarize,
What ways are there to output to a file in C and C++?
And which of these is/are the faster ones?
Obviously redirecting from the console is terrible.
Any brief comparison of printf, cout, fputc, etc. would help.
Edit:
From the comments,
There's a great baseline test of cout and printf in:
mixing cout and printf for faster output
This is a great start, but not the best answer to what I'm asking.
For example, it doesn't handle std::ostreambuf_iterator<> mentioned in the comments, if that's a possibility. Nor does it handle fputc or mention console redirection (how bad in comparison)(not that it needs to)
Edit 2:
Also, for the sake of arguing my historical case, you can assume a near infinite amount of data being output (programs literally running for days on a newer Intel i7, producing gigabytes of text)
Temporary storage is only so helpful here - you can't buffer gigabytes of data easily that I'm aware.
The functions such as fwrite, fprintf, etc. Are in fact doing a write syscall. The only difference with write is that these functions use a buffer to reduce the number of syscalls.
So, if I need to choose between fwrite, fprintf and write, I would avoid fprintf because it's a nice but complicated function that does a lot of things. If I really need something fast, I would reimplement the formating part myself to the bare minimum required. And between fwrite and write, I would pick fwrite if I need to write a lot of small data, otherwise write could be faster because it doesn't require the whole buffering system.
As far as I'm aware, the biggest bottleneck would be to write a character at a time (for example, using fputc). This is compared to building up a buffer in memory and dumping the whole lot (using fwrite). Experience has shown me that using fputc and writing individual characters is considerably slower.
This is probably because of hardware factors, rather than any one function being faster.
The bottleneck in performance of output is formatting the characters.
In embedded systems, I improved performance by formatting text into a buffer (array of characters), then sending the entire buffer to output using block write commands, such as cout.write or fwrite. The functions bypass formatting and pass the data almost straight through.
You may encounter buffering by the OS along the way.
The bottleneck isn't due to the process of formatting the characters, but the multiple calls to the function.
If the text is constant, don't call the formatted output functions, write it direct:
static const char Message[] = "Hello there\n";
cout.write(&Message[0], sizeof(Message) - 1); // -1 because the '\0' doesn't need to be written
cout is actually slightly faster than printf because it is a template function, so the assembly is pre-compiled for the used type, although the difference in speed is negligible. I think that your real bottle neck isn't the call the language is making, but your hard-drives write rate. If you really want to go all the way with this, you could create a multi-thread or network solution that will store the data in a buffer, and then slowly write the data to the a hard-drive separate from the processing of the data.
Related
I'm reading multi-gigabyte files and processing them from stdin. I'm reading from stdin like this.
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
...
}
Is there a faster way to do this? I tried a threading approach with
auto childTid = spawn(&fn, thisTid);
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
receiveOnly!(int);
send(childTid, line);
}
int x= 0;
send(childTid, x);
That allows it to load at least one more line from disk while my process is running at the cost of a copy operation, but this is still silly, what I need is fgets, or a way to combine stdio.byChunk(4096) with readline. I tried fgets.
char[] buf = new char[4096];
fgets(buf.ptr, 4096, stdio)
but it always fails with stdio is a file and not a stream. Not sure how to make it a stream. Any help would be appreciated with the approach you think best. I'm not very good at D, apologies for any noob mistakes.
There are actually already two layers of buffering under the hood (excluding the hardware itself): the C runtime library and the kernel both do a layer of buffering to minimize I/O costs.
First, the kernel keeps data from disk in its own buffer and will look ahead, loading beyond what you request in a single call if you are following a predictable pattern. This is to mitigate the low-level costs associated with seeking the device and will cache across processes - if you read a file with one program then again with a second, the second will probably get it from the kernel memory cache instead of the physical disk and may be noticeably much faster.
Second, the C library, on which D's std.stdio is built, also keeps a buffer. readln ultimately calls C file I/O functions which read a chunk from the kernel at a time. (Fun fact, writes are also buffered by the C library, default by line if user interactive and by chunk otherwise. Writing is quite slow and doing it by chunk makes a big difference, but sometimes the C lib thinks a pipe isn't interactive when it is and leads to a FAQ: Simple D program Output order is wrong )
These C lib buffers also mitigate the costs of many small reads and writes by batching them up before even sending to the kernel. In the case of readln, it will likely read several kilobytes at once, even if you ask for just one line or one byte, and the rest stays in the buffer for next time.
So your readln loop is already going to be automatically buffered and should get decent I/O performance.
You might be able to do it better yourself with a few techniques though. In that case, you may try using std.mmfile for a memory-mapped file and reading it as if i was an array, but your files are too big to fit in that on 32 bit. Might work on 64 bit though. (Note that a memory mapped file is NOT loaded all at once, it is just mapped to a memory address. When you actually touch part of it, the operating system will load/save on demand.)
Or, of course, you can use the lower level operating system functions like write from import core.sys.posix.unistd or WriteFile from import core.sys.windows.windows, which will bypass the C lib's layers (but, of course, keep the kernel layers, which you want, don't try to bypass them.)
You can look for any win32 or posix system call C tutorials if you want to know more about using those functions. It is the same in D as in C, with minor caveats like the import instead of #include.
Once you load the chunk, you will want to scan it for the newline and slice it in all probability to form the range to pass to the loop or other algorithms. The std.range and std.algorithm modules also have searching, splitting, and chunking functions that might help, but you need to be careful with lines that span the edges of your buffers to keep correctness and efficiency.
But if your performance is good enough as it is, I'd say just leave it - the C lib+kernel's buffering do a pretty good job in most cases.
I have a piece of software that performs a set of experiments (C++).
Without storing the outcomes, all experiments take a little over a minute.
The total amount of data generated is equal to 2.5 Gbyte, which is too large to store in memory till the end of the experiment and write to file afterwards.
Therefore I write them in chunks.
for(int i = 0; i < chunkSize;i++){
outfile << results_experiments[i] << endl;
}
where
ofstream outfile("data");
and outfile is only closed at the end.
However when I write them in chunks of 4700 kbytes (actually 4700/Chunksize = size of results_experiments element) the experiments take about 50 times longer (over an hour...). This is unacceptable and makes my prior optimization attempts look rather silly. Especially since these experiments again need to be perfomed using many different parameter settings ect.. (at least 100 times, but preferably more)
Concrete my question is:
What would be the ideal chunksize to write at?
Is there a more efficient way than (or something very inefficient in) the way I write data currently?
Basically: Help me getting the file IO overhead introduced as small as possible..
I think it should be possible to do this a lot faster as copying (writing & reading!) the resulting file (same size), takes me under a minute..
The code should be fairly platform independent and not use any (non standard) libraries (I can provide seperate versions for seperate platforms & more complicated install instructions, but it is a hassle..)
If it is not feasible to get the total experiment time under 5 minutes, without platform/library dependencies (and possible with), I will seriously consider introducing these. (platform is windows, but a trivial linux port should at least be possible)
Thank you for your effort.
For starters not flushing the buffer for every chunk seems like a good idea. It also seems possible to do the IO asynchronously, as it is completely independent of the computation. You can also use mmap to improve the performance of File I/O.
If the output doesn't have to be human-readable, then you could investigate a binary format. Storing data in binary format occupies less space than text format and therefore needs less disk i/o. But there'll be little difference if the data is all strings. So if you write out as much as possible as numbers and not formatted text you could get a big gain.
However I'm not sure if/how this is done with STL iostreams. The C-style way is using fopen(..., "wb") and fwrite(&object, ...).
I think boost::Serialisation can do binary output using << operator.
Also, can you reduce the amount you write? e.g. no formatting or redundant text, just the bare minimum.
Whether endl flushes the buffer when writing to a ofstream is implementation dependent--
You might also try increasing the buffer size of your ofstream
char *biggerbuffer = new char[512000];
outfile.rdbuf()->pubsetbuf(biggerbuffer,512000);
The availability of pubsetbuf may vary depending on your iostream implementation
I was trying to solve a problem on InterviewStreet. After some time I determine that I was actually spending the bulk of my time reading the input. This particular question had a lot of input, so that makes some amount of sense. What doesn't make sense is why the varying methods of input had such different performances:
Initially I had:
std::string command;
std::cin >> command;
Replacing it made it noticeably faster:
char command[5];
cin.ignore();
cin.read(command, 5);
Rewriting everything to use scanf made it even faster
char command;
scanf("get_%c", &command);
All told I cut the time reading the input down by about a 1/3.
I'm wondering there is such a variation in performance between these different methods. Additionally, I'm wondering why using gprof didn't highlight the time I was spending in I/O, rather seeming to point the blame to my algorithm.
There is a big variation in these routines because console input speed almost never matters.
And where it does (Unix shell) the code is written in C, reads directly from the stdin device and is efficient.
At the risk of being downvoted, I/O streams are, in general, slower and bulkier than their C counterparts. That's not a reason to avoid using them though in many purposes as they are safer (ever run into a scanf or printf bug? Not very pleasant) and more general (ex: overloaded insertion operator allowing you to output user-defined types). But I'd also say that's not a reason to use them dogmatically in very performance-critical code.
I do find the results a bit surprising though. Out of the three you listed, I would have suspected this to be fastest:
char command[5];
cin.ignore();
cin.read(command, 5);
Reason: no memory allocations needed and straightforward reading of a character buffer. That's also true of your C example below, but calling scanf to read a single character repeatedly isn't anywhere close to optimal either even at the conceptual level, as scanf must parse the format string you passed in each time. I'd be interested in the details of your I/O code as it seems that there is a reasonable possibility of something wrong happening when scanf calls to read a single character turn out to be the fastest. I just have to ask and without meaning to offend, but is the code truly compiled and linked with optimizations on?
Now as to your first example:
std::string command;
std::cin >> command;
We can expect this to be quite a bit slower than optimal for the reason that you're working with a variable-sized container (std::string) which will have to involve some heap allocations to read in the desired buffer. When it comes to stack vs. heap issues, the stack is always significantly faster, so if you can anticipate the maximum buffer size needed in a particular case, a simple character buffer on the stack will beat std::string for input (even if you used reserve). This is likewise true of an array on the stack as opposed to std::vector but these containers are best used for cases where you cannot anticipate the size in advance. Where std::string can be faster would be cases where people might be tempted to call strlen repeatedly where storing and maintaining a size variable would be better.
As to the details of gprof, it should be highlighting those issues. Are you looking at the full call graph as opposed to a flat profile? Naturally the flat profile could be misleading in this case. I'd have to know some further details on how you are using gprof to give a better answer.
gprof only samples during CPU time, not during blocked time.
So, a program may spend an hour doing I/O, and a microsecond doing computation, and gprof will only see the microsecond.
For some reason, this isn't well known.
By default, the standard iostreams are configured to work together and with the C stdio library — in practice this means using cin and cout for things other than interactive input and output tend to be slow.
To get good performance using cin and cout, you need to disable the synchronization with stdio. For high performance input, you might even want to untie the streams.
See the following stackoverflow question for more details.
How to get IOStream to perform better?
I am participating in some programming competitions, and on many problems there's the need to read strings from an input file. Obviously performance is a big issue on those competitions, and strings might be huge, so I am trying to understand the most efficient way to read those strings.
My guess is that reading the strings char by char, with getchar(), is the fastest you can go. That's because even if you use other functions, say fgets() or getline(), those functions will still need to read every char anyway.
Update: I know that I/O won't be a bottleneck on most algorithmic problems. That being said I would still very much like to know what's the fastest way you can use to read strings, should this become an issue on any future problem.
You can use std::istream::read() function to read a chunk of unformatted data. It is relatively faster precisely because the data is unformatted. All overloads of operator>> read formatted data which makes reading from stream slower compared to read().
Similarly, you can use std::ostream::write() function to write a chunk of data to output stream at once.
The reverse is true, reading larger chunks of data into memory in one go is far faster than reading one character at a time. The OS and/or hard drive will likley cache the data in any case, but the function call overhead alone of repeatedly cycling through the standard-library, OS, file system and device driver for each character is significant for large data sets.
When handling strings there are some more important performance issues you might consider: Back to Basics by Joel Spolsky
Either way, the most convincing way to answer the question for yourself is to write test code that investigates the difference between different I/O methods.
I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.
Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.
I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.
My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.
Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.
The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.
Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.
Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.
Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.
The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.
A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.