I'm reading multi-gigabyte files and processing them from stdin. I'm reading from stdin like this.
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
...
}
Is there a faster way to do this? I tried a threading approach with
auto childTid = spawn(&fn, thisTid);
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
receiveOnly!(int);
send(childTid, line);
}
int x= 0;
send(childTid, x);
That allows it to load at least one more line from disk while my process is running at the cost of a copy operation, but this is still silly, what I need is fgets, or a way to combine stdio.byChunk(4096) with readline. I tried fgets.
char[] buf = new char[4096];
fgets(buf.ptr, 4096, stdio)
but it always fails with stdio is a file and not a stream. Not sure how to make it a stream. Any help would be appreciated with the approach you think best. I'm not very good at D, apologies for any noob mistakes.
There are actually already two layers of buffering under the hood (excluding the hardware itself): the C runtime library and the kernel both do a layer of buffering to minimize I/O costs.
First, the kernel keeps data from disk in its own buffer and will look ahead, loading beyond what you request in a single call if you are following a predictable pattern. This is to mitigate the low-level costs associated with seeking the device and will cache across processes - if you read a file with one program then again with a second, the second will probably get it from the kernel memory cache instead of the physical disk and may be noticeably much faster.
Second, the C library, on which D's std.stdio is built, also keeps a buffer. readln ultimately calls C file I/O functions which read a chunk from the kernel at a time. (Fun fact, writes are also buffered by the C library, default by line if user interactive and by chunk otherwise. Writing is quite slow and doing it by chunk makes a big difference, but sometimes the C lib thinks a pipe isn't interactive when it is and leads to a FAQ: Simple D program Output order is wrong )
These C lib buffers also mitigate the costs of many small reads and writes by batching them up before even sending to the kernel. In the case of readln, it will likely read several kilobytes at once, even if you ask for just one line or one byte, and the rest stays in the buffer for next time.
So your readln loop is already going to be automatically buffered and should get decent I/O performance.
You might be able to do it better yourself with a few techniques though. In that case, you may try using std.mmfile for a memory-mapped file and reading it as if i was an array, but your files are too big to fit in that on 32 bit. Might work on 64 bit though. (Note that a memory mapped file is NOT loaded all at once, it is just mapped to a memory address. When you actually touch part of it, the operating system will load/save on demand.)
Or, of course, you can use the lower level operating system functions like write from import core.sys.posix.unistd or WriteFile from import core.sys.windows.windows, which will bypass the C lib's layers (but, of course, keep the kernel layers, which you want, don't try to bypass them.)
You can look for any win32 or posix system call C tutorials if you want to know more about using those functions. It is the same in D as in C, with minor caveats like the import instead of #include.
Once you load the chunk, you will want to scan it for the newline and slice it in all probability to form the range to pass to the loop or other algorithms. The std.range and std.algorithm modules also have searching, splitting, and chunking functions that might help, but you need to be careful with lines that span the edges of your buffers to keep correctness and efficiency.
But if your performance is good enough as it is, I'd say just leave it - the C lib+kernel's buffering do a pretty good job in most cases.
Related
I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.
I'm in front of a piece of code, which copies a file to a usb-device.
Following part is the important one:
while((bytesRead = fread(buf, 1, 16*1024, m_hSource)) && !bAbort) {
// write to target
long bytesWritten = fwrite(buf, 1, bytesRead, m_hTarget);
m_lBytesCopied += bytesWritten;
The thing, the customer said, it's pretty slow in comparison to normal pc<->usb speed. I didn't code this, so it's my job, to optimize.
So I was wondering, if it's a better approach to first read the complete file and then write the file in one step. But I don't know how error-prone this would be.
The code also check after each copystep if all bytes where written correctly, so that might also slow down the process.
I'm not that c++ & hardware guru, so I'm asking you guys, how I could speed things up and keep the copying successful.
Try to read/write in big chunk. 16M, 32M are not bad for copying file.
If you just want to copy the file you can always invoke system() It'll be faster.
The code also check after each copystep if all bytes where written correctly, so that might also slow down the process.
You can check it by creating hash of bigger chunk. Like splitting the file into 64M chunks. Then match hashes of those chunks. Bittorrent protocol has this feature.
If you have mmap or MapViewOfFile available, map the file first. Then write it to usb. This way read operation will be handled by kernel.
Kerrek just commented about using memcpy on mmap. memcpy with 2 mmaped file seems great.
Also note that, Most recent operating systems writes to USB stick when they are being removed. Before removal it just writes the data in a cache. So copy from OS may appear faster.
What about overlapping reads and writes?
In the current code, the total time is time(read original) + time(write copy), if you read the first block, then while writing it start reading the second block, etc. your total time would be max(time(read original), time(write copy)) (plus the time reading/writing the first and last blocks that won't be pipelined).
It could be almost half the time if reading and writing takes more or less the same time.
You can do it with two threads or with asynchronous IO. Unfortunately, threads and async IO are platform dependent, so you'll have to check your system manual or choose appropriate portable libraries.
I would just go with some OS specific functions that for sure do this faster that anything written only with c/c++ functions.
For Linux this could be sendfile function. For Windows CopyFile will do the job.
I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.
I'm writing an external merge sort. It works like that: read k chunks from big file, sort them in memory, perform k-way merge, done. So I need to sequentially read from different portions of the file during the k-way merge phase. What's the best way to do that: several ifstreams or one ifstream and seeking? Also, is there a library for easy async IO?
Use one ifstream at a time on the same file. More than one wastes resources, and you'd have to seek anyway (because by default the ifstream's file pointer starts at the beginning of the file).
As for a C++ async IO library, check out this question.
EDIT: I originally misunderstood what you are trying to do (this Wikipedia article filled me in). I don't know how much ifstream buffers by default, but you can turn off buffering by using the pubsetbuf(0, 0); method described here, and then do your own buffering. This may be slower, however, than using multiple ifstreams with automatic buffering. Some benchmarking is in order.
Definitely try the multiple streams. Seeking probably throws away internally buffered data (at least within the process, even if the OS retains it in cache), and if the items you're sorting are small that could be very costly indeed.
Anyway, it shouldn't be too hard to compare the performance of your two fstream strategies. Do a simple experiment with k = 2.
Note that there may be a limit on the number of simultaneous open files one process can have (ulimit -n). if you reach that, then you might want to consider using a single stream, but buffering data from each of your k chunks manually.
It might be worth mmapping the file and using multiple pointers, if the file is small enough (equivalently: your address space is large enough).
I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.
Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.
I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.
My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.
Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.
The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.
Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.
Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.
Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.
The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.
A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.