I have the following question: In parts of my software (mostly C++) I rely on reading precalculated binary data from a file to be used in a numerical simulation. This files can be quite big (like 32GB and upwards).
At the moment I read in the data with fread and store/navigate some important filepostions with fgetpos/fsetpos.
Mostly my software runs on HPC-Clusters and at the moment I'm trying to implement a restart feature in case I run out of Wallclocktime so that I can resume my calculations. To that end I dump a few key parameters in a binary file and would also need to store the position of my fileptr prior to abortion of my code.
So I checked around through the forum and I'm not quite sure whats the best solution to do this.
I guess I can't just write the whole fpos_t struct to disk as this can produce nonsense when I read it in again. ftell is limited to 2GB files if I'm correct?
Would ftello be a better option? Is this compatible with different compilers and OS(like intel, cray and so on)?
Thanks in advance
Related
I'm building an io framework for a custom file format for windows only. The files are anywhere between 100MB and 100GB. The reads/writes come in sequences of few hundred KB to a couple MB in unpredictable locations. Read speed is most critical, though, cpu use might trump that since I hear fstream can put a real dent in it when working with SSDs.
Originally I was planning to use fstream but as I red a bit more into file IO, I discovered a bunch of other options. Since I have little experience with the topic, I'm torn as to which method to use. The options I've scoped out are fstream, FILE and mapped file.
In my research so far, all I've found is a lot of contradictory benchmark results depending on chunk sizes, buffer sizes and other bottlenecks i don't understand. It would be helpful if somebody could clarify the advantages/drawbacks between those options.
In this case the bottleneck is mainly your hardware and not the library you use, since the blocks you are reading are relatively big 200KB - 5MB (compared to the sector size) and sequential (all in one).
With hard disks (high seek time) it could have a sense to read more data than needed to optimize the caching. With SSDs I would not use big buffers but read only the exact required data since the seek time is not a big issue.
Memory Mapped files are convenient for complete random access to your data, especially in small chunks (even few bytes). But it takes more code to setup a memory mapped file. On 64 bit systems you may could map the whole file (virtually) and then let the OS caching system optimize the reads (multiple access to the same data). You can then return just the pointer to the required memory location without even the need to have temporary buffers or use memcpy. It would be very simple.
fstream gives you extra features compared to FILE which I think aren't much of use in your case.
I am developing an application which basically stores 2D matrices in memory and performs mathematical operations on them. When I benchmarked my software I found that file reading and file saving were performing very badly. So I multi threaded file reading and this resulted in tremendous boost in performance. The reason for boost in this performance may not be due to I/O but rather due to the translation of string data from file into double being distributed among threads.
Now I want to improve my file saving performance. Simply speaking it is not possible to multi thread saving data to a single file. So what if the data is broken up into different files (= number of cores)? Is this the correct way to solve this problem? Also how do I make all these files look as a single file in Windows Explorer so as to hide this complexity from the user?
To summarize my comments:
generally, matrix computations is much slower than matrix printing.
on my Linux/Debian/Sid/x86-64 system (i3770K, GCC 4.8 with gcc -O2), a tiny C program looping a million times for (long i=0; i<cnt; i++) printf("%.15f\n", log(1.0+i*sqrt((double)i))); takes (when redirecting stdout to /tmp/my.out) 0.79s user 0.02s system 99% cpu 0.810 total .... The output contains a million numbers, totalizing 18999214 bytes.... ; so you might blame your file system, operating system, or library (perhaps using C <stdio.h> functions like printf might be a bit faster than C++ operator << ....).
you could serialize your output in some binary format if you really wanted to, and provide e.g. a .dll plugin for Excel to read it; but I don't think it is worth your effort.
Notice that I updated my sample code to output 15 digits per double-precision number!
BTW, I suggest you to make your sequalator software a free software, e.g. to publish its source code (e.g. under GPLv3+ license) on some repository like github.... You probably could get more help if you published your source code (under a friendly free software license).
You might consider switching to Linux (and use a recent GCC, e.g. 4.8); it probably is faster for such applications (but then, I agree that using Excel could be an issue; they are several free software altenatives, e.g. gnumeric; also scilab could interest you ...)
BTW, nothing in my answer above refers to multi-threading, because you cannot easily multi-thread the output of some textual file....
The reason for boost in this performance may not be due to I/O but
rather due to the translation of string data from file into double
being distributed among threads.
If this is the case, consider storing binary data instead of text. Given that you are dealing with 2D matrixes, a useful format might be HDF5. You can then read and write at full speed, and it supports compression too if you need that for even more disk space savings. I doubt you'll need threads at all if you do this.
If I need to read from a file very often, and I will load the file into a vector of unsigned char using fread, the consequent fread are really fast, even if the vector of unsigned char is destroy right after reading.
It seems to me that something (Windows or the disk) caches the file and thus freads are very fast. I have not read anything about this behaviour, so I am unsure what really causes this.
If I don't use my application for 1 hour or so and then do an fread again, the fread is slow.
It seems to me that the cache got emptied.
Can somebody explain this behaviour to me? I would like to actively use it.
It is a problem for me when the freads are slow.
Memory-mapping the file works theoretically, but the file itself is too big, so I can not use it.
90/10 law
90% of the execution time of a computer program is spent executing 10% of the code
It is not a rule but usually it is so, so lots of programs tries to keep recent data if possible because it is very likely that that data will be accessed very soon again.
Windows OS is not an exception, after receiving command to read file OS keeps some data about file. It stores in memory addresses of ages where the program is stored, if possible even store some part (or even all) of binary data in memory, it makes next file read much faster if that read is just after the first-one.
All-in-all you are right that there is caching, but I can't to say, that is really going on as I'm not working in Microsoft...
Also answering into next part of question. File mapping into memory may be solution but if the file is very large machine may not have stat much memory so it wouldn't be an option. However, you can use the 90/10 law. In your case you should have just a part of file mapped into memory (that part that is the most important), also while reading you should make a data table of overall parameters.
Don't know exact situation, but it may save.
I have a program written in C++, that opens a binary file(test.bin), reads it object by object, and puts each object into a new file (it opens the new file, writes into it(append), and closes it).
I use fopen/fclose, fread and fwrite.
test.bin contains 20,000 objects.
This program runs under linux with g++ in 1 sec but in VS2008 in debug/release mode in 1min!
There are reasons why I don't do them in batches or don't keep them in memory or any other kind of optimizations.
I just wonder why it is that much slow under windows.
Thanks,
I believe that when you close a file in Windows, it flushes the contents to disk each time. In Linux, I don't think that is the case. The flush on each operation would be very expensive.
Unfortunately file access on Windows isn't renowned for its brilliant speed, particularly if you're opening lots of files and only reading and writing small amounts of data. For better results, the (not particularly helpful) solution would be to read large amounts of data from a small number of files. (Or switch to Linux entirely for this program?!)
Other random suggestions to try:
turn off the virus checker if you have one (I've got Kaspersky on my PC, and writing 20,000 files quickly drove it bananas)
use an NTFS disk if you have one (FAT32 will be even worse)
make sure you're not accidentally using text mode with fopen (easily done)
use setvbuf to increase the buffer size for each FILE
try CreateFile/ReadFile/etc. instead of fopen and friends, which won't solve your problem but may shave a few seconds off the running time (since the stdio functions do a bit of extra work that you probably don't need)
I think it is not matter of VS 2008. It is matter of Linux and Windows file system differences. And how C++ works with files in both systems.
I'm seeing a lot of guessing here.
You're running under VS2008 IDE. You can always use the "poor man's profiler" and find out exactly what's going on.
In that minute, hit the "pause" button and look at what it's doing, including the call stack. Do this several times. Every single pause is almost certain (Prob = 59/60) to catch it doing precisely what it doesn't do under Linux.
I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.
Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.
I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.
My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.
Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.
The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.
Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.
Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.
Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.
The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.
A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.