Accessing separate files from separate threads, is this efficient? - c++

I have an application that loads files and processes data. Let's assume I have like 10...20 files to process.
some requirements, to make the question clearer:
files are small, maybe a few MB max
there might be a dozen files, maybe a hundred
one example might be parsing CSV data, or JSON, loading game 3d models
One idea is to use some thread pool and process files in parallel. Is this efficient? Can my operating system handle file access from multiple threads?
I found this question:
Accessing a single file with multiple threads
But in my application one thread would access its "own" file, so there wouldn't be any collisions.
In my application, I'm using C++/STL, but I'd like to know the general opinion about filesystems on Linux and Windows.

You need to benchmark. (probably in your case it could be worth to use several threads; however in your case, the loading should be so quick, even done sequentially, that your average user won't notice)
In many cases, when you deal with medium sized files (e.g. less than a dozen of megabytes each, or perhaps even half a gigabyte each) which have been accessed recently, these files practically sit in the page cache. So you won't access the disk itself, and your program practically works in RAM (and then multithreading should be effective).
BTW, Linux has readahead(2), posix_fadvise(2), madvise(2) to hint the kernel virtual memory subsystem (that is, to give hints to the page cache).
If your common use case is accessing the disk itself (e.g. because the files are quite big, or because you have not accessed them recently before, so they are not in the page cache), then multi-threading won't help, because the bottleneck becomes the hardware disk.
Remember that a disk (even an SSD one) is many thousands times slower than RAM and it does IO operations sequentially.
Also, you may spend some amount of CPU time in parsing the files. If that takes a significant amount of CPU, it is worth to be run in several independent threads.

In my experience you get more performance if the processing of the data is heavy. In this case you really make parallel the execution of your program. You also need to know how many core your cpu have. It is not worth have more threads than cpu cores.
If your processing is "light", probably your threads are always waiting of disk to complete reading, with little, if ever, gain in performance.

Related

why multi-thread cant improve a mmap task?

I have a big task, which need to read 500 files (50G in total).
for every file, i should read it out, and do some calculation according to data from file. just calculate, nothing else. and i can ensure tasks are independent, just share some signleton object to read(i think that wont be the problem).
currently, i use mmap to get the file content's start pointer, and loop to calculate.
in single thread, i run the task, cost 30s,
i run it in a thread_pool, it cost me 35s(6 thread).
my machine is a 16G memory, 2.2G hz cpu with 8 thread.
I try a lot of setting, and carefully ensure the independent of tasks.
I am not so good at hardware, is there a hard limit about IO, that limit my speed? can anyone remind me is there anything i can read?
sorry, the code is too complex, i cant make a valid demo here.
You can try to use the MAP_POPULATE flag on mmap to read ahead if you want to load the whole file or use madvise.
The most important hardware detail here is not mentioned, if you read from SSD or HDD but i assume you use a SSD, otherwise the thread pool code would be much much slower.
I don't understand why you use mmaping here. There are only three valid reasons to mmap a file, first the data structure on disk is complex and you like to poke around, which is slow as it makes read ahead much less efficient. You need shared memory between processes. Or you work on huge files and need the OS functionality to swap out data to the file when your system comes under memory stress (all databases just do it for only this single reason).

Hard disk contention using multiple threads

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.
Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.
It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.
A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.

Loading 128 files in parallel in C++

I have been working on a project and the new requirement is to load 128 files into the physical memory in parallel. All these 128 files reside in the same directory/folder. Is there an algorithm or solution that I can use to solve this problem? I need to code in C++.
The fastest way to load 128 files is sequentially. Parallelism doesn't work, since the disk heads can't exist in multiple places at a time. And even with random-access storage such as an SSD, or the disk's DRAM cache, they still have to cross the bus sequentially.
After you read them they certainly can exist in memory in parallel.
I suggest a for loop for checking file size, allocating memory, and reading each file. The loop will iterate 128 times. As you get each file, you can start data processing in parallel with subsequent reads.
Parallel computation speeds things up because you have a multi-core processor. Overlapped network requests speed things up because there's a long round-trip latency. Parallel disk I/O speeds things up only if you have multiple disks, with the data appropriately split among them. Yours isn't. (And if you use a RAID stripe set, the disk controller will issue parallel reads with no extra work by your application)
If your managers insist "it simply must be read in parallel, there's a requirement", start talking about an array of 128 disks, with a fancy overlay system to make files on 128 disks appear as if they are in the same directory.
The requirement should get more reasonable after that.
Whilst I completely agree with Ben Voigt's answer, if you really fancy still doing this (if nothing else, to prove to your management that it's not worth doing), then the solution is:
Create a list of the 128 files you want to load.
For each item in the list, create a thread, give the thread the name of the file and a location to store the data [that is big enough for the content of the file, or a dynamic storage such as a std::vector].
In each thread, open the given file, read the content into the given storage. Close the file, finish the thread.
Wait for all threads to finish.
I can pretty much guarantee that unless you have some really exotic hardware, even at 4 files in parallel, unless the files are tiny, this solution is slower than a sequential process.

When doing a parallel search, when will memory bandwidth become the limiting factor?

I have some large files (from several gigabytes to hundreds of gigabytes) that I'm searching and trying to find every occurrence of a given string.
I've been looking into making this operate in parallel and have some questions.
How should I be doing this? I can't copy the entire file into memory since its too big. Will multiple FILE* pointers work?
How many threads can I put on the file before the disk bandwidth becomes a limiting factor, rather than the CPU? How can I work around this?
Currently, what I was thinking is I would use 4 threads, task each with a FILE* at either 0%, 25%, 50%, and 75% way through the file, and have each save its results to a file or memory, and then collect the results as a final step. Though with this approach, depending on bandwidth, I could easily add more threads and possibly get a bigger speedup.
What do you think?
EDIT: When I said memory bandwidth, I actually meant disk I/O. Sorry about that.
With this new revised version of the question, the answer is "almost immediately". Hard disks aren't very good at reading from two places on the disk at the same time. :) If you had multiple hard drives and split your file across them, you could probably take advantage of some threading. To be fair, though, I would say that the disk speed is already the limiting factor. I highly doubt that your disk can read data faster than the processor can handle it.
I doubt memory bandwidth will be as big of a problem as disk IO limitations. With most hardware, you're going to be very restricted on how each thread can read from disk -
If you want to maximize throughput, you may need to do something like have one thread who's job is to handle disk IO (most hardware can only stream one chunk from disk at a time, so that'll be a limiting factor). It can then take this and push off chunks of memory to individual threads in some type of thread pool to process.
My guess is that your processing will be fast - probably much faster than the disk IO - but if it's slow, having multiple processing threads could speed up your entire operation.
Multiple FILE* pointers will work - but may actually be slower than just having a single one, since they'll end up time slicing to read the file, and you'll be jumping around on your disk more.
if you are using a SSD drive. you may overcome this problem with parallel searching through the file with multiple file pointers.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.