Thread Optimization [duplicate] - c++

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.

Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.
If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.

Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.

Other option:
Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).

It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream (and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)

My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Related

On a c++ efficient storage, flushing into file(s) strategy

Here is the situation: A c++ program is endlessly generating data in a regular fashion. The data needs to be stored in persistent storage very quickly so it does not impede the computing time. It is not possible to know the amount of data that will be stored in advance.
After reading this and this posts, I end up following this naive strategy:
Creating one std::ofstream ofs
Opening a new file ofs.open("path/file", std::ofstream::out | std::ofstream::app)
Adding std::string using the operator <<
Closing the file has terminated ofs.close()
Nevertheless, I am still confused about the following:
Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
I have understood that flushing should be done automatically by std::ofstream, I am safe to use it as such? Is there any impact on memory I should be aware of? Do I have to optimize the std::ofstream in some ways (changing its size?)?
Should I be concerned about the file getting bigger and bigger? Should I close it at some point and open a new one?
Does using std::string have some drawbacks? Is there some hidden conversions that could be avoided?
Is using std::ofstream::write() more advantageous?
Thanks for your help.
1.Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
Since all the datatype on any storage device is binary telling compiler to save it so will result in more or less optimized saving of 0's & 1's. It depends on... many things and how you are going to use/read it after. Some of them listed in Writing a binary file in C++ very fast.
When comes to storing on HD, perfomance of your code is always limited to speed of particular HD (which is widespread fact).
Try to give a "certainty/frames" to your questions, they are too general for stating as "problem"
I'm probably not answering your direct questions, but please excuse me trying if I take a step back.
If I understand the issue correctly, the concern is about staying too long writing to disk that would delay the endless data generation.
Perhaps you can allocate a thread just for writing, while processing continues on the main thread.
The writer thread could awake at periodic intervals to write to disk what it has been generated so far.
Communication between the two threads can be either:
two buffers (one active where the generation happens, one frozen, ready to be written to disk on the next batch)
or a queue of data, inserted by the producer and removed by the consumer/writer.

Best way to read/write file in multithreaded environment (C++)

i have a multithreaded program that reads and writes files. One thread receives data and writes them in a file. Every 250 Mb of data, a new file is created. Multiple other threads can read into these files to retrieve data. I'm using C++ std file stream.
To prevent problems, my current implementation uses two file descriptors for the same file: one for readers and one for the writer. A mutex protects from multiple access at the same time, and the file descriptor position is moved each time the mutex owner needs it.
I really need to be able to read in the file as fast as possible, and the mutex doesn't really help me.
Firstly, I would like to know if it's safe to read and write the file or have multiple reads at the same time (on every platform).
Secondly, if yes, I would like to know how it is safe for the hardware like the "Disk read-and-write head" for a HDD. The software works on the disk all the time to save data, and i don't want my algorithm to decrease too much the hard disk life time (already short).
Thank you for your help
There is no problem regarding multiple threads reading the same file.
Now, if I understood your description correctly, you do not modify already-written data, you just continuously append data to your file until it reaches 250Mb, then you continue writing on a new file.
If this is the case, you may not need a mutex at all. For instance, you might be able to keep your whole "file" into memory until it reaches 250mb, and only then you would write it all to disk, so you know that any files already on disk aren't going to be written anymore and can be read freely with no worries. As for the file that is still being written, you can have a global integer that holds how many bytes (or strings or whatever you use) have already been written, and reading-threads are limited by this integer, which does not need a lock, as long as you only update the integer after you have already written the data. (since you said there is only 1 thread writing data).
Simply reading the integer cannot corrupt it even when being done by multiple threads at the same time and being written by a single one, so this will ensure your reader threads will not read beyond the limit, and such limit will always be safe and consistent, while the writer-thread can peacefully write data in an area that is guaranteed to not be bothered by read-threads until it is finished.
As for your second question, if you are indeed able to keep the currently-being-written file fully in memory, that will already save up some HDD usage, as well as time. Additionally, keep in mind most modern HDDs have 32Mb+ of cache, so it is not like every read and write will be directly hitting the HDD itself, unless you have a ton of threads reading random files and random parts of them all the time. If that is the case, there is probably not much you can do to help the HDD. And if that's not the case, there is not much to worry about, as the OS and the caches will do what they were meant to do :)

c++: how to optimize IO?

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Several ifstreams vs. ifstream + constant seeking

I'm writing an external merge sort. It works like that: read k chunks from big file, sort them in memory, perform k-way merge, done. So I need to sequentially read from different portions of the file during the k-way merge phase. What's the best way to do that: several ifstreams or one ifstream and seeking? Also, is there a library for easy async IO?
Use one ifstream at a time on the same file. More than one wastes resources, and you'd have to seek anyway (because by default the ifstream's file pointer starts at the beginning of the file).
As for a C++ async IO library, check out this question.
EDIT: I originally misunderstood what you are trying to do (this Wikipedia article filled me in). I don't know how much ifstream buffers by default, but you can turn off buffering by using the pubsetbuf(0, 0); method described here, and then do your own buffering. This may be slower, however, than using multiple ifstreams with automatic buffering. Some benchmarking is in order.
Definitely try the multiple streams. Seeking probably throws away internally buffered data (at least within the process, even if the OS retains it in cache), and if the items you're sorting are small that could be very costly indeed.
Anyway, it shouldn't be too hard to compare the performance of your two fstream strategies. Do a simple experiment with k = 2.
Note that there may be a limit on the number of simultaneous open files one process can have (ulimit -n). if you reach that, then you might want to consider using a single stream, but buffering data from each of your k chunks manually.
It might be worth mmapping the file and using multiple pointers, if the file is small enough (equivalently: your address space is large enough).