Loading batch of images - Thread allocation - c++

So, I have a lot of images to be loaded from the disk, I was wondering how many threads should I allocate to the task to obtain maximum performance.
I am not specifying SO because my project is cross-platform.
I think I will work mainly with PNG, i.e. it is not slow to decompress but there is some decompression involved.
Also, if I end up creating one thread for each image, is the thread-overhead big enough to slow down considerably my process?

Sometimes a producer consumer architecture is good enough.
Other times what you describe could also work, given that you don't have more threads that the CPUs available can handle (ie more threads than #CPUs*2 usually (not always) leads to thrashing).
You need to perform some tests in order to see which model works best for you. Think about where do these images come from (disk? Are they in consecutive locations on disk or not. Does it make sense to produce multiple threads and just wait for disk IO to load a small chunk of one photo from disk, then context switch to another thread and do another seek on disk to get a small chunk of another file and so on.
I suggest try single thread application.

One thread per disk seems like a reasonable start. You could make it a runtime tuning parameter to see what works best, especially if there are, or might be, non-local network disks, (ie. high latency), or, as others have suggested, there is any decompression or video processing to be done.
One thread per image is not a good idea, again, as posted by others. You will need some P-C queues to feed the thread/s with objects that contain an image buffer + file spec and also to return the same objects after the load is done - continually creating/terminating/destroying threads is wasteful, difficult and prone to disaster.

Related

why multi-thread cant improve a mmap task?

I have a big task, which need to read 500 files (50G in total).
for every file, i should read it out, and do some calculation according to data from file. just calculate, nothing else. and i can ensure tasks are independent, just share some signleton object to read(i think that wont be the problem).
currently, i use mmap to get the file content's start pointer, and loop to calculate.
in single thread, i run the task, cost 30s,
i run it in a thread_pool, it cost me 35s(6 thread).
my machine is a 16G memory, 2.2G hz cpu with 8 thread.
I try a lot of setting, and carefully ensure the independent of tasks.
I am not so good at hardware, is there a hard limit about IO, that limit my speed? can anyone remind me is there anything i can read?
sorry, the code is too complex, i cant make a valid demo here.
You can try to use the MAP_POPULATE flag on mmap to read ahead if you want to load the whole file or use madvise.
The most important hardware detail here is not mentioned, if you read from SSD or HDD but i assume you use a SSD, otherwise the thread pool code would be much much slower.
I don't understand why you use mmaping here. There are only three valid reasons to mmap a file, first the data structure on disk is complex and you like to poke around, which is slow as it makes read ahead much less efficient. You need shared memory between processes. Or you work on huge files and need the OS functionality to swap out data to the file when your system comes under memory stress (all databases just do it for only this single reason).

What does "Disk Profiling" mean (related to hard disks)? [duplicate]

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

When doing a parallel search, when will memory bandwidth become the limiting factor?

I have some large files (from several gigabytes to hundreds of gigabytes) that I'm searching and trying to find every occurrence of a given string.
I've been looking into making this operate in parallel and have some questions.
How should I be doing this? I can't copy the entire file into memory since its too big. Will multiple FILE* pointers work?
How many threads can I put on the file before the disk bandwidth becomes a limiting factor, rather than the CPU? How can I work around this?
Currently, what I was thinking is I would use 4 threads, task each with a FILE* at either 0%, 25%, 50%, and 75% way through the file, and have each save its results to a file or memory, and then collect the results as a final step. Though with this approach, depending on bandwidth, I could easily add more threads and possibly get a bigger speedup.
What do you think?
EDIT: When I said memory bandwidth, I actually meant disk I/O. Sorry about that.
With this new revised version of the question, the answer is "almost immediately". Hard disks aren't very good at reading from two places on the disk at the same time. :) If you had multiple hard drives and split your file across them, you could probably take advantage of some threading. To be fair, though, I would say that the disk speed is already the limiting factor. I highly doubt that your disk can read data faster than the processor can handle it.
I doubt memory bandwidth will be as big of a problem as disk IO limitations. With most hardware, you're going to be very restricted on how each thread can read from disk -
If you want to maximize throughput, you may need to do something like have one thread who's job is to handle disk IO (most hardware can only stream one chunk from disk at a time, so that'll be a limiting factor). It can then take this and push off chunks of memory to individual threads in some type of thread pool to process.
My guess is that your processing will be fast - probably much faster than the disk IO - but if it's slow, having multiple processing threads could speed up your entire operation.
Multiple FILE* pointers will work - but may actually be slower than just having a single one, since they'll end up time slicing to read the file, and you'll be jumping around on your disk more.
if you are using a SSD drive. you may overcome this problem with parallel searching through the file with multiple file pointers.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Profiling disk access

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.