I am working on a C++ project to track particles through some known force fields. The code generates huge amount of data in forms of particle's positions and momentum. I am already using openmp directives in the particle tracking routines.However, the overall performance is finally decided by the time taken to write output files. I understand that using threads to write data to a output file is not recommended (and i have tried that). I am curious if there is any way to use multiple threads to write data to multiple files (say I have 4 threads and each thread writes to 4 files at the same time). Can you please suggest me how to proceed? Any tips on how to effectively stream data to files?
Thanks in advance
Here is an interesting article on the topic of concurrent I/O:
http://www.drdobbs.com/parallel/multithreaded-file-io/220300055
There are a few basic roadblocks that limit the gains for this approach. Number one is hardware capability. HDDs and SSDs have limited read and write speeds, and trying to read/write multiple files simultaneously may not yield a substantial increase in speed. In fact on a hard disk trying to do multiple things at once can actually hurt performance in many situations, which can be seen in the benchmarks at the link I provided. It seems that with 2-4 threads a noticeable gain can be achieved in reading, but the specs on writing were disappointing. Multi threading will definitely help speed up serialization of data though.
Related
I have an application that loads files and processes data. Let's assume I have like 10...20 files to process.
some requirements, to make the question clearer:
files are small, maybe a few MB max
there might be a dozen files, maybe a hundred
one example might be parsing CSV data, or JSON, loading game 3d models
One idea is to use some thread pool and process files in parallel. Is this efficient? Can my operating system handle file access from multiple threads?
I found this question:
Accessing a single file with multiple threads
But in my application one thread would access its "own" file, so there wouldn't be any collisions.
In my application, I'm using C++/STL, but I'd like to know the general opinion about filesystems on Linux and Windows.
You need to benchmark. (probably in your case it could be worth to use several threads; however in your case, the loading should be so quick, even done sequentially, that your average user won't notice)
In many cases, when you deal with medium sized files (e.g. less than a dozen of megabytes each, or perhaps even half a gigabyte each) which have been accessed recently, these files practically sit in the page cache. So you won't access the disk itself, and your program practically works in RAM (and then multithreading should be effective).
BTW, Linux has readahead(2), posix_fadvise(2), madvise(2) to hint the kernel virtual memory subsystem (that is, to give hints to the page cache).
If your common use case is accessing the disk itself (e.g. because the files are quite big, or because you have not accessed them recently before, so they are not in the page cache), then multi-threading won't help, because the bottleneck becomes the hardware disk.
Remember that a disk (even an SSD one) is many thousands times slower than RAM and it does IO operations sequentially.
Also, you may spend some amount of CPU time in parsing the files. If that takes a significant amount of CPU, it is worth to be run in several independent threads.
In my experience you get more performance if the processing of the data is heavy. In this case you really make parallel the execution of your program. You also need to know how many core your cpu have. It is not worth have more threads than cpu cores.
If your processing is "light", probably your threads are always waiting of disk to complete reading, with little, if ever, gain in performance.
I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?
Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.
I would like to search for a given string in multiple files in parallel using CUDA. I have planned to use pfac library to search for the given string. The problem with this is how to access multiple files in parallel.
Example: We have a folder containing 1000s of files which has to be searched.
The problem here is how should i access multiple files in the given folder.The files in the folder should be dynamically obtained and each thread should be assigned a file to search the given string.
Is it possible????
Edit:
In this post: very fast text file processing (C++) .He is using the boost library to read a 3 GB text file in 16 seconds.While in my case I have to read 1000s of smaller files
Thank you
Doing your task in CUDA will not help much over doing the same thing in CPU.
Assuming that your files are stored on a standard, magnetic HDD, the typical single-threaded CPU program would consume:
About 5ms to find the sector where the file is stored and put it under the reading head.
About 10ms to load 1MB file (assuming 100MB/s read speed) into RAM memory
Less than 0.1ms to load 1MB data from RAM to CPU cache and process it using a linear search algorithm.
That is 15.1ms for a single file. If you have 1000 files, it will take 15.1s to do the work.
Now, if I give you super-powerful GPU with infinite memory bandwith, no latency, and infinite processor speed, you will be able to perform the task (3) with no time. However, HDD reads will still consume exactly the same time. GPU cannot parallelise the work of another, independent device.
As a result, instead of spending 15.1s, you will now do it in 15.0s.
The infinite GPU would give you a 0.6% speedup. A real GPU would be not even close to that!
In more general case: If you consider using CUDA, ask yourself: is the actual computation the bottleneck of the problem?
If yes - continue searching for possible solutions in the CUDA world.
If no - CUDA cannot help you.
If you deal with thousants of tiny files and you need to perform reads often, consider techniques that can "attack" your bottleneck. Some may include:
RAM buffering
Putting your hard drives in a RAID configuration
Getting an SSD
there may be more options, I am not an expert in that area.
Yes, it's probably possible to get a speed-up using CUDA if you can reduce the impact of read latency/bandwidth. One way would be by performing multiple searches concurrently. I.e. If you can search for [needle1], .. [needle1000] in your large haystack then each thread could search haystack-pieces and store the hits. Some analysis of the throughput required per-comparisons is required to determine whether your search is likely to be improved by employing CUDA. This may be useful http://dl.acm.org/citation.cfm?id=1855600
I had posted some boost multithreads before. This time I just curious and disappointed because I thought multithreads suppose to be faster than single one.
Two threads are FILE I/O read/parse the CSV data. When I used multithreads, it took about 40 seconds average per machine PENTIUM D CPU from DELL DESKTOP OPTILEX 745.
With single thread, it took about 8-10 seconds average same PC.
I had tried to use completely different parameters name from these two threads, the result is no different. If someone had been used c++ boost multi-threads for reading big data file and parsing before, I would love to hear your opinions. Thanks.
Andrew
Two threads are FILE I/O read/parse the CSV data.
If they're reading the same file with the same file handle, then they might be spending most of their time blocked waiting for the other thread to get done. If they're using different file handles to read the same file, they're forcing the disk to keep seeking back and forth, which isn't as efficient an operation as a straight sequential read.
Threading doesn't speed up big file reading and parsing. What it does is let you do something else entirely while the file is being read and parsed.
You've created an I/O bottleneck, which threading does not help with. Threading is intended for reducing CPU bottlenecks when the algorithm can be broken into independent threads of execution; algorithms that have a lot of dependency on previous output (file parsing is one case) generally don't thread well.
If you can split up the parsing problem and have each thread parse a different part of the file, you might get a little improvement, but probably not since the seeking will be wasting your time. If you can have one thread doing bulk reading and some preprocessing, then handing off chunks to a thread pool for the real heavy processing (is ther any?), then you might see some noticeable improvement over single threading.
This is all general and a bit stream-of-consciousness, but it's hard to do much more with what you're giving. I hope it helps.
Without seeing your code it's hard to say exactly what's going on, but in general, multiple threads don't necessarily get you better performance, and in fact can very often lead to obvious performance degradation.
In your situation, if you are having both threads read and parse, then they could be contending for I/O, and possibly the locks surrounding any shared read/write memory areas, both of which would introduce a slow-down where the single-threaded version would have no issue.
To do this properly, you would probably want a single thread reading from the file, and another thread parsing the data as it came in on a producer/consumer queue. This would minimize the lock contention (since it can be implemented with waiters only), and would ensure that you were acutally taking advantage of the parralellization available in your problem.
That being said, a single-threaded version might still be faster; it's often the case.
I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.