How to fill thrust vector in a parallel way? [duplicate] - c++

I would like to search for a given string in multiple files in parallel using CUDA. I have planned to use pfac library to search for the given string. The problem with this is how to access multiple files in parallel.
Example: We have a folder containing 1000s of files which has to be searched.
The problem here is how should i access multiple files in the given folder.The files in the folder should be dynamically obtained and each thread should be assigned a file to search the given string.
Is it possible????
Edit:
In this post: very fast text file processing (C++) .He is using the boost library to read a 3 GB text file in 16 seconds.While in my case I have to read 1000s of smaller files
Thank you

Doing your task in CUDA will not help much over doing the same thing in CPU.
Assuming that your files are stored on a standard, magnetic HDD, the typical single-threaded CPU program would consume:
About 5ms to find the sector where the file is stored and put it under the reading head.
About 10ms to load 1MB file (assuming 100MB/s read speed) into RAM memory
Less than 0.1ms to load 1MB data from RAM to CPU cache and process it using a linear search algorithm.
That is 15.1ms for a single file. If you have 1000 files, it will take 15.1s to do the work.
Now, if I give you super-powerful GPU with infinite memory bandwith, no latency, and infinite processor speed, you will be able to perform the task (3) with no time. However, HDD reads will still consume exactly the same time. GPU cannot parallelise the work of another, independent device.
As a result, instead of spending 15.1s, you will now do it in 15.0s.
The infinite GPU would give you a 0.6% speedup. A real GPU would be not even close to that!
In more general case: If you consider using CUDA, ask yourself: is the actual computation the bottleneck of the problem?
If yes - continue searching for possible solutions in the CUDA world.
If no - CUDA cannot help you.
If you deal with thousants of tiny files and you need to perform reads often, consider techniques that can "attack" your bottleneck. Some may include:
RAM buffering
Putting your hard drives in a RAID configuration
Getting an SSD
there may be more options, I am not an expert in that area.

Yes, it's probably possible to get a speed-up using CUDA if you can reduce the impact of read latency/bandwidth. One way would be by performing multiple searches concurrently. I.e. If you can search for [needle1], .. [needle1000] in your large haystack then each thread could search haystack-pieces and store the hits. Some analysis of the throughput required per-comparisons is required to determine whether your search is likely to be improved by employing CUDA. This may be useful http://dl.acm.org/citation.cfm?id=1855600

Related

Is it possible: Using multiple threads to stream output to different files /

I am working on a C++ project to track particles through some known force fields. The code generates huge amount of data in forms of particle's positions and momentum. I am already using openmp directives in the particle tracking routines.However, the overall performance is finally decided by the time taken to write output files. I understand that using threads to write data to a output file is not recommended (and i have tried that). I am curious if there is any way to use multiple threads to write data to multiple files (say I have 4 threads and each thread writes to 4 files at the same time). Can you please suggest me how to proceed? Any tips on how to effectively stream data to files?
Thanks in advance
Here is an interesting article on the topic of concurrent I/O:
http://www.drdobbs.com/parallel/multithreaded-file-io/220300055
There are a few basic roadblocks that limit the gains for this approach. Number one is hardware capability. HDDs and SSDs have limited read and write speeds, and trying to read/write multiple files simultaneously may not yield a substantial increase in speed. In fact on a hard disk trying to do multiple things at once can actually hurt performance in many situations, which can be seen in the benchmarks at the link I provided. It seems that with 2-4 threads a noticeable gain can be achieved in reading, but the specs on writing were disappointing. Multi threading will definitely help speed up serialization of data though.

Loading 128 files in parallel in C++

I have been working on a project and the new requirement is to load 128 files into the physical memory in parallel. All these 128 files reside in the same directory/folder. Is there an algorithm or solution that I can use to solve this problem? I need to code in C++.
The fastest way to load 128 files is sequentially. Parallelism doesn't work, since the disk heads can't exist in multiple places at a time. And even with random-access storage such as an SSD, or the disk's DRAM cache, they still have to cross the bus sequentially.
After you read them they certainly can exist in memory in parallel.
I suggest a for loop for checking file size, allocating memory, and reading each file. The loop will iterate 128 times. As you get each file, you can start data processing in parallel with subsequent reads.
Parallel computation speeds things up because you have a multi-core processor. Overlapped network requests speed things up because there's a long round-trip latency. Parallel disk I/O speeds things up only if you have multiple disks, with the data appropriately split among them. Yours isn't. (And if you use a RAID stripe set, the disk controller will issue parallel reads with no extra work by your application)
If your managers insist "it simply must be read in parallel, there's a requirement", start talking about an array of 128 disks, with a fancy overlay system to make files on 128 disks appear as if they are in the same directory.
The requirement should get more reasonable after that.
Whilst I completely agree with Ben Voigt's answer, if you really fancy still doing this (if nothing else, to prove to your management that it's not worth doing), then the solution is:
Create a list of the 128 files you want to load.
For each item in the list, create a thread, give the thread the name of the file and a location to store the data [that is big enough for the content of the file, or a dynamic storage such as a std::vector].
In each thread, open the given file, read the content into the given storage. Close the file, finish the thread.
Wait for all threads to finish.
I can pretty much guarantee that unless you have some really exotic hardware, even at 4 files in parallel, unless the files are tiny, this solution is slower than a sequential process.

Performance issues with hard disk reading

I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:
(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?
(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?
Is there anything I can do for this?
1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.
2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.
Your numbered questions seem to be answered already. If you're still wondering what you can do to improve hard drive read speed, here are some tips:
Read with the OS functions (e.g., ReadFile) rather than wrapper libraries (like iostreams or stdio) if possible. Many wrappers introduce more levels of buffering.
Read sequentially, and let Windows know you're going to read sequentially with the FILE_FLAG_SEQUENTIAL_SCAN flag.
If you're only going to read (and not write), be sure to open the file just for reading.
Read in chunks, not bytes or characters.
Ideally the chunks should be multiples of the disk's cluster size.
Read from the disc at cluster-aligned offsets.
Read to memory at page-boundaries. (If you're allocating a big chunk, it's probably page aligned.)
Advanced: If you can start your computation after reading just the beginning the file, then you can used overlapped I/O to try to parallelize the computation and the subsequent reads as much as possible.
Yes, Windows (and most modern OS's) keep recently read file data in otherwise unused RAM so that if that file data is requested again in the near future it will already be available in RAM and disk access can be avoided.
As far as making disk access faster, you could try defragmenting your drive, but I wouldn't expect it to help too much. Drive access is just slow compared to RAM access, which is why RAM caching provides such a nice speedup.
As a diagnostic test, can you accurately measure the time it takes to load the very first time?
Then take that to determine the transfer rate. Then you can take that transfer rate and compare that to what you get when running HD Tune. For what it's worth, I ran this myself and got 44.2 MB/s minimum, 87 MB/s average, 110 MB/s max read speeds with my Western Digital RE3 drive (one of the faster 7200 RPM SATA drives available).
The point of all this is to see if your own application is doing the best it can. In other words, aside from caching you can't really read the files any faster than what your hard drive is capable of. So if you're reaching that limit then there's nothing more to do.
Also, make sure that you are not running out of memory during your tests. Run perfmon and monitor Memory > Available Bytes and PhysicalDisk > Disk Read Bytes/sec for the physical drive you are reading. Monitoring process' I/O is a good idea too. Keep in mind that the latter combines all I/O (network included).
You should expect 50 MB/s for sequential reads from a single average SATA drive. A couple of good striped serial SCSI drive will give you about 220 MB/s. If you are seeing available memory going to near zero, that would be your problem. If it stays flat after you did the first round of reading than it has something to do with your app.
A Microsoft utility called contig can be used to defragment a single file on disk or to create a new unfragmented file.
For the crazy answer, you could try formatting the drive such that you place your info on the fastest portion, and see if that helps any.
Tom's Hardware had a review on how that might be done.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.