I have been working on a project and the new requirement is to load 128 files into the physical memory in parallel. All these 128 files reside in the same directory/folder. Is there an algorithm or solution that I can use to solve this problem? I need to code in C++.
The fastest way to load 128 files is sequentially. Parallelism doesn't work, since the disk heads can't exist in multiple places at a time. And even with random-access storage such as an SSD, or the disk's DRAM cache, they still have to cross the bus sequentially.
After you read them they certainly can exist in memory in parallel.
I suggest a for loop for checking file size, allocating memory, and reading each file. The loop will iterate 128 times. As you get each file, you can start data processing in parallel with subsequent reads.
Parallel computation speeds things up because you have a multi-core processor. Overlapped network requests speed things up because there's a long round-trip latency. Parallel disk I/O speeds things up only if you have multiple disks, with the data appropriately split among them. Yours isn't. (And if you use a RAID stripe set, the disk controller will issue parallel reads with no extra work by your application)
If your managers insist "it simply must be read in parallel, there's a requirement", start talking about an array of 128 disks, with a fancy overlay system to make files on 128 disks appear as if they are in the same directory.
The requirement should get more reasonable after that.
Whilst I completely agree with Ben Voigt's answer, if you really fancy still doing this (if nothing else, to prove to your management that it's not worth doing), then the solution is:
Create a list of the 128 files you want to load.
For each item in the list, create a thread, give the thread the name of the file and a location to store the data [that is big enough for the content of the file, or a dynamic storage such as a std::vector].
In each thread, open the given file, read the content into the given storage. Close the file, finish the thread.
Wait for all threads to finish.
I can pretty much guarantee that unless you have some really exotic hardware, even at 4 files in parallel, unless the files are tiny, this solution is slower than a sequential process.
Related
I have a big task, which need to read 500 files (50G in total).
for every file, i should read it out, and do some calculation according to data from file. just calculate, nothing else. and i can ensure tasks are independent, just share some signleton object to read(i think that wont be the problem).
currently, i use mmap to get the file content's start pointer, and loop to calculate.
in single thread, i run the task, cost 30s,
i run it in a thread_pool, it cost me 35s(6 thread).
my machine is a 16G memory, 2.2G hz cpu with 8 thread.
I try a lot of setting, and carefully ensure the independent of tasks.
I am not so good at hardware, is there a hard limit about IO, that limit my speed? can anyone remind me is there anything i can read?
sorry, the code is too complex, i cant make a valid demo here.
You can try to use the MAP_POPULATE flag on mmap to read ahead if you want to load the whole file or use madvise.
The most important hardware detail here is not mentioned, if you read from SSD or HDD but i assume you use a SSD, otherwise the thread pool code would be much much slower.
I don't understand why you use mmaping here. There are only three valid reasons to mmap a file, first the data structure on disk is complex and you like to poke around, which is slow as it makes read ahead much less efficient. You need shared memory between processes. Or you work on huge files and need the OS functionality to swap out data to the file when your system comes under memory stress (all databases just do it for only this single reason).
I have an application that loads files and processes data. Let's assume I have like 10...20 files to process.
some requirements, to make the question clearer:
files are small, maybe a few MB max
there might be a dozen files, maybe a hundred
one example might be parsing CSV data, or JSON, loading game 3d models
One idea is to use some thread pool and process files in parallel. Is this efficient? Can my operating system handle file access from multiple threads?
I found this question:
Accessing a single file with multiple threads
But in my application one thread would access its "own" file, so there wouldn't be any collisions.
In my application, I'm using C++/STL, but I'd like to know the general opinion about filesystems on Linux and Windows.
You need to benchmark. (probably in your case it could be worth to use several threads; however in your case, the loading should be so quick, even done sequentially, that your average user won't notice)
In many cases, when you deal with medium sized files (e.g. less than a dozen of megabytes each, or perhaps even half a gigabyte each) which have been accessed recently, these files practically sit in the page cache. So you won't access the disk itself, and your program practically works in RAM (and then multithreading should be effective).
BTW, Linux has readahead(2), posix_fadvise(2), madvise(2) to hint the kernel virtual memory subsystem (that is, to give hints to the page cache).
If your common use case is accessing the disk itself (e.g. because the files are quite big, or because you have not accessed them recently before, so they are not in the page cache), then multi-threading won't help, because the bottleneck becomes the hardware disk.
Remember that a disk (even an SSD one) is many thousands times slower than RAM and it does IO operations sequentially.
Also, you may spend some amount of CPU time in parsing the files. If that takes a significant amount of CPU, it is worth to be run in several independent threads.
In my experience you get more performance if the processing of the data is heavy. In this case you really make parallel the execution of your program. You also need to know how many core your cpu have. It is not worth have more threads than cpu cores.
If your processing is "light", probably your threads are always waiting of disk to complete reading, with little, if ever, gain in performance.
I have some large files (from several gigabytes to hundreds of gigabytes) that I'm searching and trying to find every occurrence of a given string.
I've been looking into making this operate in parallel and have some questions.
How should I be doing this? I can't copy the entire file into memory since its too big. Will multiple FILE* pointers work?
How many threads can I put on the file before the disk bandwidth becomes a limiting factor, rather than the CPU? How can I work around this?
Currently, what I was thinking is I would use 4 threads, task each with a FILE* at either 0%, 25%, 50%, and 75% way through the file, and have each save its results to a file or memory, and then collect the results as a final step. Though with this approach, depending on bandwidth, I could easily add more threads and possibly get a bigger speedup.
What do you think?
EDIT: When I said memory bandwidth, I actually meant disk I/O. Sorry about that.
With this new revised version of the question, the answer is "almost immediately". Hard disks aren't very good at reading from two places on the disk at the same time. :) If you had multiple hard drives and split your file across them, you could probably take advantage of some threading. To be fair, though, I would say that the disk speed is already the limiting factor. I highly doubt that your disk can read data faster than the processor can handle it.
I doubt memory bandwidth will be as big of a problem as disk IO limitations. With most hardware, you're going to be very restricted on how each thread can read from disk -
If you want to maximize throughput, you may need to do something like have one thread who's job is to handle disk IO (most hardware can only stream one chunk from disk at a time, so that'll be a limiting factor). It can then take this and push off chunks of memory to individual threads in some type of thread pool to process.
My guess is that your processing will be fast - probably much faster than the disk IO - but if it's slow, having multiple processing threads could speed up your entire operation.
Multiple FILE* pointers will work - but may actually be slower than just having a single one, since they'll end up time slicing to read the file, and you'll be jumping around on your disk more.
if you are using a SSD drive. you may overcome this problem with parallel searching through the file with multiple file pointers.
I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.
I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.