I am developing an application which basically stores 2D matrices in memory and performs mathematical operations on them. When I benchmarked my software I found that file reading and file saving were performing very badly. So I multi threaded file reading and this resulted in tremendous boost in performance. The reason for boost in this performance may not be due to I/O but rather due to the translation of string data from file into double being distributed among threads.
Now I want to improve my file saving performance. Simply speaking it is not possible to multi thread saving data to a single file. So what if the data is broken up into different files (= number of cores)? Is this the correct way to solve this problem? Also how do I make all these files look as a single file in Windows Explorer so as to hide this complexity from the user?
To summarize my comments:
generally, matrix computations is much slower than matrix printing.
on my Linux/Debian/Sid/x86-64 system (i3770K, GCC 4.8 with gcc -O2), a tiny C program looping a million times for (long i=0; i<cnt; i++) printf("%.15f\n", log(1.0+i*sqrt((double)i))); takes (when redirecting stdout to /tmp/my.out) 0.79s user 0.02s system 99% cpu 0.810 total .... The output contains a million numbers, totalizing 18999214 bytes.... ; so you might blame your file system, operating system, or library (perhaps using C <stdio.h> functions like printf might be a bit faster than C++ operator << ....).
you could serialize your output in some binary format if you really wanted to, and provide e.g. a .dll plugin for Excel to read it; but I don't think it is worth your effort.
Notice that I updated my sample code to output 15 digits per double-precision number!
BTW, I suggest you to make your sequalator software a free software, e.g. to publish its source code (e.g. under GPLv3+ license) on some repository like github.... You probably could get more help if you published your source code (under a friendly free software license).
You might consider switching to Linux (and use a recent GCC, e.g. 4.8); it probably is faster for such applications (but then, I agree that using Excel could be an issue; they are several free software altenatives, e.g. gnumeric; also scilab could interest you ...)
BTW, nothing in my answer above refers to multi-threading, because you cannot easily multi-thread the output of some textual file....
The reason for boost in this performance may not be due to I/O but
rather due to the translation of string data from file into double
being distributed among threads.
If this is the case, consider storing binary data instead of text. Given that you are dealing with 2D matrixes, a useful format might be HDF5. You can then read and write at full speed, and it supports compression too if you need that for even more disk space savings. I doubt you'll need threads at all if you do this.
Related
I have the following question: In parts of my software (mostly C++) I rely on reading precalculated binary data from a file to be used in a numerical simulation. This files can be quite big (like 32GB and upwards).
At the moment I read in the data with fread and store/navigate some important filepostions with fgetpos/fsetpos.
Mostly my software runs on HPC-Clusters and at the moment I'm trying to implement a restart feature in case I run out of Wallclocktime so that I can resume my calculations. To that end I dump a few key parameters in a binary file and would also need to store the position of my fileptr prior to abortion of my code.
So I checked around through the forum and I'm not quite sure whats the best solution to do this.
I guess I can't just write the whole fpos_t struct to disk as this can produce nonsense when I read it in again. ftell is limited to 2GB files if I'm correct?
Would ftello be a better option? Is this compatible with different compilers and OS(like intel, cray and so on)?
Thanks in advance
Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.
I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.
We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.
I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
{
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
filestream.seekg(read_num);
}
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?
On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).
Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.
I have a 1GB binary file on another system.
Requirement: ftp/download and convert binary to CSV on main system.
The converted file will be magnitudes larger ~ 8GB
What is the most common way of doing something similar to this?
Should this be a two step independent process, download - then convert?
Should I download small chunks at a time and convert while downloading?
I don't know the most efficient way to do this...also what should I be cautions of with files this size?
Any advice is appreciated.
Thank You.
(Visual Studio C++)
I would write a program that converts the binary format and outputs to CSV format. This program would read from stdin and write to stdout.
Then I would call
wget URL_of_remote_binary_file --output-document=- | my_converter_program > output_file.csv
That way you can start converting immediately (without downloading the entire files) and your program doesn't deal with networking. You can also run the program on the remote side, assuming it's portable enough.
Without knowing any specifics, I would go with a binary ftp download and then post-process with a separate conversion program. This would break the process into two distinct and unrelated parts which would aid in building and debugging the overall system. No need to reinvent an FTP system and lots of potential to optimize the post-processing.
To avoid too much traffic I would in a first step compress and transfer the file. The conversion process, if something goes wrong or want another output can be redone locally without refetching the data.
The only precaution is not to load the whole stuff in memory and then convert but do it chunk-wise like you said. You can prevent some unpleasant effects for users of your program by creating/pre-allocating a huge file of the max expected size. This to avoid running out of disk space during the conversion phase. Also some filesystems do not like files bigger than 2GB or 4GB, that would also be caught by the pre-allocation trick.
It depends on your data and your requirements. What performance requirements do you have? Do you need to finish such as task in X amount of time (where speed is critical), or is this something that will just be done periodically (in which case speed is not essential)?
That said, you will certainly get a cleaner implementation if you separate the work out into two tasks - a downloader and a converter. That way each component can be simple and just focus on the task at hand. All things being equal, I recommend this approach.
Otherwise if you try to download/convert at the same time you may get into situations where your downloader has data ready, but the converter needs more data before it can proceed. Again, there is no reason why your code cannot handle this, but it will make the implementation more complicated and that much more difficult to debug / test / validate.
It's usually better to do it as separate processes with no interdependency. If your requirements change in the future you can reuse the pieces, or use them for other projects.
Here are even more guesses about your requirements and possible solutions:
Concerned about file integrity? Implement something that includes integrity checks such as sequence numbers, size fields and checksums/hashes, and just enough transaction semantics so that the system knows whether a transfer completed or didn't.
Are uploads happening on slow/congested links, and may be interrrupted? Implement a protocol that allows the transfer to resume after interruption.
Are uploads recurring, with much of the data unchanged? Implement something amenable to incremental update, so you upload only the differences.
I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.
Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.
I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.
My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.
Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.
The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.
Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.
Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.
Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.
The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.
A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.