Thread IDs with PPL and Parallel Memory Allocation

Thread IDs with PPL and Parallel Memory Allocation - c++

I have a question about the Microsoft PPL library, and parallel programming in general. I am using FFTW to perform a large set (100,000) of 64 x 64 x 64 FFTs and inverse FFTs. In my current implementation, I use a parallel for loop and allocate the storage arrays within the loop. I have noticed that my CPU usage only tops out at about 60-70% in these cases. (Note this is still better utilization than the built in threaded FFTs provided by FFTW which I have tested). Since I am using fftw_malloc, is it possible that excessive locking is occurring which is preventing full usage?
In light of this, is it advisable to preallocate the storage arrays for each thread before the main processing loop, so no locks are required within the loop itself? And if so, how is this possible with the MSFT PPL library? I have been using OpenMP before, in that case it is simple enough to get a thread ID using supplied functions. I have not however seen a similar function in the PPL documentation.

I am just answering this because nobody has posted anything yet.
Mutex(e)s can wreak havoc on performance if heavy locking is required. In addition if a lot of memory (re)-allocation is needed, that can also decrease performance and limit it to your memory bandwidth. Like you said a preallocation which later threads operate on can be usefull. However this requires that you have a fixed threadcount and that you spread your workload balanced on all threads.
Concerning the PPL thread_id functions, I can only speak about Intel-TBB, which however should be pretty similiar to PPL. TBB - and I suppose also PPL - is not speaking of threads directly, instead they are talking about tasks, the aim of TBB was to abstract these underlaying details away from the user, thus it does not provide a thread_id function.

Using PPL I have had good performance with an application that does a lot of allocations by using a Concurrency::combinable to hold a structure containing memory allocated per thread.
In fact you don't have to pre-allocate you can check the value of your combinable variable with ->local() and allocate it if it is null. Next time this thread is called it will already be allocated.
Of course you have to free the memory when all task are done which can be done using:
with something like:
combine_each([](MyPtr* p){ delete p; });

Related

OpenMP and memory limitation

I have a c++ application that uses openmp parallel construct.
The method inside the for loop uses a lot of memory. It allocates memory at start and release them at the end.
If the system has enough memory, it works well, but if there is not enough memory, the operation fails.
The target system may have enough memory so only 2 thread can be run in parallel or maybe 3 thread can be run in parallel.
Is there any way to configure openmp so it does know how many thread it should use based on available memory?
If OpenMP can no do this, is there any way that I can do this by myself?

OpenMP is very dumb when it comes to monitoring memory usage and you would have to implement it by yourself. A good strategy would be to obtain the amount of usable memory and then to divide it by the memory requirement per thread in order to get the upper limit of the number of threads that can process data concurrently. Once you know that number, you could force the parallel region to run with that many threads using the num_threads clause:
int max_threads = mem_size / mem_per_thread;
#pragma omp parallel for num_threads(max_threads)
for (...)
{
}
Now the hard question is how to obtain the amount of usable memory, especially given that virtually all modern operating systems implement virtual memory. One solution would be to leave that to the end user, e.g. provide a parameter in your program's configuration that the user can set to a specific value that he deems reasonable. Another strategy could be to set the value to a given % of the physical memory size.

Multithreaded mex code slower than single threaded

I am writing mex code in MATLAB to do and operation (because the operation uses a library in c++). The mex code has a section where there is a function that is repeatedly called in a loop with a different argument value, and each function call is independent (i.e., computation of 1 call does not depend on previous calls). So, to speed this up I wrote multithreaded code that creates multiple threads - the exact number of threads is equal to the number of loop iterations, in my example this value is 10. Each thread computes the function in the loop for a separate value of the argument, the threads return and join, some more computation is done and a result is returned.
All this in theory should give me good speedup, but I see that the multithreaded code is a lot slower than the normal single threaded one!! I have access to very powerful 24 core machines, so this is totally baffling, because I'd expected each thread to be scheduled on a separate core.
Any ideas to what is leading to this? Any common problems/errors in code that lead to this?
Any help will be greatly appreciated.
EDIT:
To answer many doubts raised in solutions proposed by people here, I want to share some information about my code:
1. Each function call takes a few minutes, so synchronization and spawning of threads should not be an overhead here (though if there are any mitigating circumstances in this case, any info about that would be really helpful!)
Each thread does access common data structures, arrays, matrices but the values in these are not overwritten at all. All writes to variables are done to variables, pointers, arrays, etc that are local to the thread. So, I am guessing there shouldn't be many cache misses here?
Also there are no mutex sections in my code, since no thread write to any common memory location. All writes are to memory locations local to the thread.
I'm still trying to figure out the reason why my multithreaded implementation is not working :( So, any pointers/info will be really helpful!
Thanks!!

Given how general your question is, the general answer is that there are probably two effects in play:
There is large overhead involved starting and stopping threads (and synchronizing them), and the computation scaling is not enough to overcome the overhead. The total times per function call will shed some light on this issue.
Threads can compete with each other and slow down the aggregate performance. A common mechanism is "cache thrashing". Since multiple cores share the same memory controller and parts of the cache hiearchy, one thread can fill the cache with the information it needs, only to have some of that data evicted by the needs of a different thread, causing more trips to main memory. Since main memory access is so expensive, the end result is a slowdown.
I would test the job with varying numbers of threads. It may turn out, for instance, that using two threads is advantageous, but four or more is not. For more detailed answers, add more details to the question, such as type of computation, size of dataset, etc.

You didn't describe what your code does, so this is just guesswork.
Multithreading is not a miracle cure. There are a lot of ways that multithreading what was a single threaded chunk of code can be slower than the original. There's a good deal of overhead involved in spawning, synchronizing, joining, and destroying threads.
Suppose the task at hand was to add ten pairs of numbers. If you make this multithreaded by spawning a thread for each addition and then joining and destroying when the calculation is finished, your multithreaded version will be much, much slower than the original. Threading is not intended for very short duration calculations. The costs of spawning, joining, and destroying are going to overwhelm any speedup you gain by performing those simple tasks in parallel.
Another way to make things slower is to establish barriers the prevent parallel operations. A mutex, for example, to protect against multiple writers simultaneously accessing the same object. That protected code needs to be small. Make the entire bodies of your thread operate under the guise of a mutex and you have the equivalent of a single threaded application that has a whole bunch of threading overhead added in.
Those barriers that preclude parallel execution might be present even if you didn't put them in place. Some of those barriers are in the C standard library. POSIX mandates that most library functions be thread safe. The standard only lists the functions that don't have to be thread safe. If you use library functions in those computations, you might be better of staying single threaded because your code essentially is single threaded.

I do not think your problems are mex specific at all - this sounds like usual performance problems while programing multi-threaded code for SMPs.
To add a little to the already mentioned potential problems:
False cache line sharing: you might think that your threads work independently, while in fact they access different data within the same cache line. Trivial example:
/* global variable accessible by all threads */
int thread_data[nthreads];
/* inside thread function */
thread_data[thrid] = some_value;
inefficient memory bandwidth utilization. On NUMA systems you want the CPUs to access their own data banks. If you do not correctly distribute the data, the CPUs ask for memory from other CPUs. That implies communication, which you do not suspect is there.
thread affinity. Somewhat connected to the point above. You want your threads to be bound to their own CPUs for the entire duration of the computations. Otherwise they might be migrated by the OS, which causes overhead, and they might be moved further away from the memory bank they will access.

Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?
I want them to be in the heap.
Let's assume that i have a multi-core processor for the job.
I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?
Does the answer depend on the OS? please tell me how it works on different platforms if so.
Edit:
The integer array allocation loop was just a simplified example. Don't bother telling me how I can improve that.

It depends on many things, but primarily:
the OS
the implementation of malloc you are using
The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.
Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.
There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.
There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.
Your best bet is to allocate by arenas:
Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size
There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.
tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.

The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.
If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).

I think that perhaps you need to adjust your expectation from multi-threading.
The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.
Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.

As for glibc it has arena's (see here), which has lock per arena.
You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)

As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.
You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/
which has a multithreading friendly scalable allocator.
But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.
Do something like
new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.
Do you understand?

Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.

If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.
int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;
Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.

The answer depends on the operating system and runtime used, but in most cases, you cannot.
Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.
The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.
The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.
It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program.
Logically, the program can be split into groups, where each group is run by a thread.
Inside each group, some classes invoke the 'new' operator a total of 10000 times.
Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.
One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.
I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?

Standard CRT
While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).
Scalable replacements
I know about:
HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)
You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.
In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.

Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.
What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).
I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).
You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).

There are 2 scalable drop-in replacements for malloc that I know of:
Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)
I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.
One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.

I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.
Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?
One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?
If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.
Good luck!

Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.
(you can override new and delete)
So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.
limit your threads to the number of cores you have available.

new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.
Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?

You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe.
Else it will result in either a segfault or UB.

On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.
Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.

You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."

The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading.
Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.
Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks

Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.

If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.

I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.

As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing

I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.

Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...

To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.

Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick

Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.

One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.

Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.

I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.

I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.

It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!

You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.

There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js