Force Program / Thread to use 100% of processor(s) resources

Force Program / Thread to use 100% of processor(s) resources - c++

I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++

A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]

The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4

You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.

You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.

To take full use of a multicore processor, you need to make the program multithreaded.

An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.

By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.

Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.

To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!

Related

Increasing C++ Program CPU Use

I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.

That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.

Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.

It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.

Creating a thread & giving higher priority to the thread might be one way.

If you use C++, consider using Intel Threading Building Block. You can find some examples here.

Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.

how to use quad core CPU in application

For using all the cores of a quad core processor what do I need to change in my code is it about adding support of multi threading or is it which is taken care by OS itself. I am having FreeBSD and language I am using is C++. I want to give complete CPU cycles to my application at least 90%.

You need some form of parallelism. Multi-threading or multi-processing would be fine.
Usually, multiple threads are easier to handle (since they can access shared data) than multiple processes. However, usually, multiple threads are harder to handle (since they access shared data) than multiple processes.
And, yes, I wrote this deliberately.
If you have a SIMD scenario, Ninefingers' suggestion to look at OpenMP is also very good. (If you don't know what SIMD means, see Ninefingers' helpful comment below.)

For multi-threaded applications in C++ may I suggest Boost.Thread which should help you access the full potential of your quad-core machine.
As for changing your code, you might want to consider making things as immutable as possible. State transitions between threads are much more difficult to debug. There a plethora of things that could potentially happen in unexpected ways. See this SO thread.

Another option not mentioned here, threading aside, is the use of OpenMP available via the -fopenmp and the libgomp library, both of which I have installed on my FreeBSD 8 system.
These give you #pragma directives to parallelise certain loops, while statements etc i.e. the bits you can parallelise. It takes care of threading and cpu association for you. Note it is a general solution and therefore might not be the optimum way to parallelise, but it will allow you to parallelise certain routines.
Take a look at this: https://computing.llnl.gov/tutorials/openMP/
As for using threads/processes themselves, certain routines and ways of working lend themselves to it. Can you break tasks out into such a way? Does it make sense to fork() your process or create a thread? If so, do so, but if not, don't try to force your application to be multi-threaded just because. An example I usually give is the greatest common divisor algorithm - it relies on the step before all the time in the traditional implementation therefore is difficult to make parallel.
Also note it is well known that for certain algorithms, parallelisation is actually slower for small values of whatever you are doing in parallel, because although the jobs complete more quickly, the associated time cost of forking and joining (be that threads or processes) actually pushes the time above that of a serial implementation.

I think your only option is to run several threads. If your application is single-threaded, then it will only run on one of the cores (at a time), but if you have more threads, they can run simultaneously.

You need to add support to your application for parallelism through the use of Threading.
Once you have support for parallelism, it's up to the OS to assign your threads to CPU cores.

The first thing I think you should look at is whether your application and its algorithms are suited to be executed in parellel (or possibly as a set of serial tasks that can be processed independently). If this is not the case, it will be difficult to multithread it or break it up into parallel processes, and you may need to look into modifying the way it works.
Once you have established that you will be able to benefit from parallel processing you have the option to either use several processes or threads. The choice depends a lot on the nature of your application and how independent the parallel processes can be. It is easier to coordinate and share data between threads since they are in the same process, but also quite a bit more challenging to develop and debug.
Boost.Thread is a good library if you decide to go down the multi-threaded route.

I want to give complete CPU cycles to my application at least 90%.
Why? Your chip's not hot enough?
Seriously, it takes world experts dozens if not hundreds of hours to parallelize and load-balance an application so that it uses 90% of all four cores. Your CPU is already paid for and it costs the same whether you use it or not. (Actually, it costs slightly less to run, electrically speaking, if you don't use it.) How much is your time worth? How many hours are you willing to invest in order to make more effective use of a resource that may have cost you $300 and is probably sitting idle most of the time anyway?
It's possible to get speedups through parallelism, but it's expensive in human time. You need a good reason to justify it. (Learning how is a good enough reason.)
All the good books I know on parallel programming are for languages other than C++, and for good reason. If you want interesting stuff on parallelism check out Implicit Parallel Programmin in pH or Concurrent Programming in ML or the Fortress Project.

What are the "things to know" when diving into multi-threaded programming in C++

I'm currently working on a wireless networking application in C++ and it's coming to a point where I'm going to want to multi-thread pieces of software under one process, rather than have them all in separate processes. Theoretically, I understand multi-threading, but I've yet to dive in practically.
What should every programmer know when writing multi-threaded code in C++?

I would focus on design the thing as much as partitioned as possible so you have the minimal amount of shared things across threads. If you make sure you don't have statics and other resources shared among threads (other than those that you would be sharing if you designed this with processes instead of threads) you would be fine.
Therefore, while yes, you have to have in mind concepts like locks, semaphores, etc, the best way to tackle this is to try to avoid them.

I am no expert at all in this subject. Just some rule of thumb:
Design for simplicity, bugs really are hard to find in concurrent code even in the simplest examples.
C++ offers you a very elegant paradigm to manage resources(mutex, semaphore,...): RAII. I observed that it is much easier to work with boost::thread than to work with POSIX threads.
Build your code as thread-safe. If you don't do so, your program could behave strangely

I am exactly in this situation: I wrote a library with a global lock (many threads, but only one running at a time in the library) and am refactoring it to support concurrency.
I have read books on the subject but what I learned stands in a few points:
think parallel: imagine a crowd passing through the code. What happens when a method is called while already in action ?
think shared: imagine many people trying to read and alter shared resources at the same time.
design: avoid the problems that points 1 and 2 can raise.
never think you can ignore edge cases, they will bite you hard.
Since you cannot proof-test a concurrent design (because thread execution interleaving is not reproducible), you have to ensure that your design is robust by carefully analyzing the code paths and documenting how the code is supposed to be used.
Once you understand how and where you should bottleneck your code, you can read the documentation on the tools used for this job:
Mutex (exclusive access to a resource)
Scoped Locks (good pattern to lock/unlock a Mutex)
Semaphores (passing information between threads)
ReadWrite Mutex (many readers, exclusive access on write)
Signals (how to 'kill' a thread or send it an interrupt signal, how to catch these)
Parallel design patterns: boss/worker, producer/consumer, etc (see schmidt)
platform specific tools: openMP, C blocks, etc
Good luck ! Concurrency is fun, just take your time...

You should read about locks, mutexes, semaphores and condition variables.
One word of advice, if your app has any form of UI make sure you always change it from the UI thread. Most UI toolkits/frameworks will crash (or behave unexpectedly) if you access them from a background thread. Usually they provide some form of dispatching method to execute some function in the UI thread.

Never assume that external APIs are threadsafe. If it is not explicitly stated in their docs, do not call them concurrently from multiple threads. Instead, limit your use of them to a single thread or use a mutex to prevent concurrent calls (this is rather similar to the aforementioned GUI libraries).
Next point is language-related. Remember, C++ has (currently) no well-defined approach to threading. The compiler/optimizer does not know if code might be called concurrently. The volatile keyword is useful to prevent certain optimizations (i.e. caching of memory fields in CPU registers) in multi-threaded contexts, but it is no synchronization mechanism.
I'd recommend boost for synchronization primitives. Don't mess with platform APIs. They make your code difficult to port because they have similar functionality on all major platforms, but slightly different detail behaviour. Boost solves these problems by exposing only common functionality to the user.
Furthermore, if there's even the smallest chance that a data structure could be written to by two threads at the same time, use a synchronization primitive to protect it. Even if you think it will only happen once in a million years.

One thing I've found very useful is to make the application configurable with regard to the actual number of threads it uses for various tasks. For example, if you have multiple threads accessing a database, make the number of those threads be configurable via a command line parameter. This is extremely handy when debugging - you can exclude threading issues by setting the number to 1, or force them by setting it to a high number. It's also very handy when working out what the optimal number of threads is.

Make sure you test your code in a single-cpu system and a multi-cpu system.
Based on the comments:-
Single socket, single core
Single socket, two cores
Single socket, more than two cores
Two sockets, single core each
Two sockets, combination of single, dual and multi core cpus
Mulitple sockets, combination of single, dual and multi core cpus
The limiting factor here is going to be cost. Ideally, concentrate on the types of system your code is going to run on.

In addition to the other things mentioned, you should learn about asynchronous message queues. They can elegantly solve the problems of data sharing and event handling. This approach works well when you have concurrent state machines that need to communicate with each other.
I'm not aware of any message passing frameworks tailored to work only at the thread level. I've only seen home-brewed solutions. Please comment if you know of any existing ones.
EDIT:
One could use the lock-free queues from Intel's TBB, either as-is, or as the basis for a more general message-passing queue.

Since you are a beginner, start simple. First make it work correctly, then worry about optimizations. I've seen people try to optimize by increasing the concurrency of a particular section of code (often using dubious tricks), without ever looking to see if there was any contention in the first place.
Second, you want to be able to work at as high a level as you can. Don't work at the level of locks and mutexs if you can using an existing master-worker queue. Intel's TBB looks promising, being slightly higher level than pure threads.
Third, multi-threaded programming is hard. Reduce the areas of your code where you have to think about it as much as possible. If you can write a class such that objects of that class are only ever operated on in a single thread, and there is no static data, it greatly reduces the things that you have to worry about in the class.

A few of the answers have touched on this, but I wanted to emphasize one point:
If you can, make sure that as much of your data as possible is only accessible from one thread at a time. Message queues are a very useful construct to use for this.
I haven't had to write much heavily-threaded code in C++, but in general, the producer-consumer pattern can be very helpful in utilizing multiple threads efficiently, while avoiding the race conditions associated with concurrent access.
If you can use someone else's already-debugged code to handle thread interaction, you're in good shape. As a beginner, there is a temptation to do things in an ad-hoc fashion - to use a "volatile" variable to synchronize between two pieces of code, for example. Avoid that as much as possible. It's very difficult to write code that's bulletproof in the presence of contending threads, so find some code you can trust, and minimize your use of the low-level primitives as much as you can.

My top tips for threading newbies:
If you possibly can, use a task-based parallelism library, Intel's TBB being the most obvious one. This insulates you from the grungy, tricky details and is more efficient than anything you'll cobble together yourself. The main downside is this model doesn't support all uses of multithreading; it's great for exploiting multicores for compute power, less good if you wanted threads for waiting on blocking I/O.
Know how to abort threads (or in the case of TBB, how to make tasks complete early when you decide you didn't want the results after all). Newbies seem to be drawn to thread kill functions like moths to a flame. Don't do it... Herb Sutter has a great short article on this.

Make sure to explicitly know what objects are shared and how they are shared.
As much as possible make your functions purely functional. That is they have inputs and outputs and no side effects. This makes it much simpler to reason about your code. With a simpler program it isn't such a big deal but as the complexity rises it will become essential. Side effects are what lead to thread-safety issues.
Plays devil's advocate with your code. Look at some code and think how could I break this with some well timed thread interleaving. At some point this case will happen.
First learn thread-safety. Once you get that nailed down then you move onto the hard part: Concurrent performance. This is where moving away from global locks is essential. Figuring out ways to minimize and remove locks while still maintaining the thread-safety is hard.

Keep things dead simple as much as possible. It's better to have a simpler design (maintenance, less bugs) than a more complex solution that might have slightly better CPU utilization.
Avoid sharing state between threads as much as possible, this reduces the number of places that must use synchronization.
Avoid false-sharing at all costs (google this term).
Use a thread pool so you're not frequently creating/destroying threads (that's expensive and slow).
Consider using OpenMP, Intel and Microsoft (possibly others) support this extension to C++.
If you are doing number crunching, consider using Intel IPP, which internally uses optimized SIMD functions (this isn't really multi-threading, but is parallelism of a related sorts).
Have tons of fun.

Stay away from MFC and it's multithreading + messaging library.
In fact if you see MFC and threads coming toward you - run for the hills (*)
(*) Unless of course if MFC is coming FROM the hills - in which case run AWAY from the hills.

The biggest "mindset" difference between single-threaded and multi-threaded programming in my opinion is in testing/verification. In single-threaded programming, people will often bash out some half-thought-out code, run it, and if it seems to work, they'll call it good, and often get away with it using it in a production environment.
In multithreaded programming, on the other hand, the program's behavior is non-deterministic, because the exact combination of timing of which threads are running for which periods of time (relative to each other) will be different every time the program runs. So just running a multithreaded program a few times (or even a few million times) and saying "it didn't crash for me, ship it!" is entirely inadequate.
Instead, when doing a multithreaded program, you always should be trying to prove (at least to your own satisfaction) that not only does the program work, but that there is no way it could possibly not work. This is much harder, because instead of verifying a single code-path, you are effectively trying to verify a near-infinite number of possible code-paths.
The only realistic way to do that without having your brain explode is to keep things as bone-headedly simple as you can possibly make them. If you can avoid using multithreading totally, do that. If you must do multithreading, share as little data between threads as possible, and use proper multithreading primitives (e.g. mutexes, thread-safe message queues, wait conditions) and don't try to get away with half-measures (e.g. trying to synchronize access to a shared piece of data using only boolean flags will never work reliably, so don't try it)
What you want to avoid is the multithreading hell scenario: the multithreaded program that runs happily for weeks on end on your test machine, but crashes randomly, about once a year, at the customer's site. That kind of race-condition bug can be nearly impossible to reproduce, and the only way to avoid it is to design your code extremely carefully to guarantee it can't happen.
Threads are strong juju. Use them sparingly.

You should have an understanding of basic systems programing, in particular:
Synchronous vs Asynchronous I/O (blocking vs. non-blocking)
Synchronization mechanisms, such as lock and mutex constructs
Thread management on your target platform

I found viewing the introductory lectures on OS and systems programming here by John Kubiatowicz at Berkeley useful.

Part of my graduate study area relates to parallelism.
I read this book and found it a good summary of approaches at the design level.
At the basic technical level, you have 2 basic options: threads or message passing. Threaded applications are the easiest to get off the ground, since pthreads, windows threads or boost threads are ready to go. However, it brings with it the complexity of shared memory.
Message-passing usability seems mostly limited at this point to the MPI API. It sets up an environment where you can run jobs and partition your program between processors. It's more for supercomputer/cluster environments where there's no intrinsic shared memory. You can achieve similar results with sockets and so forth.
At another level, you can use language type pragmas: the popular one today is OpenMP. I've not used it, but it appears to build threads in via preprocessing or a link-time library.
The classic problem is synchronization here; all the problems in multiprogramming come from the non-deterministic nature of multiprograms, which can not be avoided.
See the Lamport timing methods for a further discussion of synchronizations and timing.
Multithreading is not something that only Ph.D.`s and gurus can do, but you will have to be pretty decent to do it without making insane bugs.

I'm in the same boat as you, I am just starting multi threading for the first time as part of a project and I've been looking around the net for resources. I found this blog to be very informative. Part 1 is pthreads, but I linked starting on the boost section.

I have written a multithreaded server application and a multithreaded shellsort. They were both written in C and use NT's threading functions "raw" that is without any function library in-between to muddle things. They were two quite different experiences with different conclusions to be drawn. High performance and high reliability were the main priorities although coding practices had a higher priority if one of the first two was judged to be threatened in the long term.
The server application had both a server and a client part and used iocps to manage requests and responses. When using iocps it is important never to use more threads than you have cores. Also I found that requests to the server part needed a higher priority so as not to lose any requests unnecessarily. Once they were "safe" I could use lower priority threads to create the server responses. I judged that the client part could have an even lower priority. I asked the questions "what data can't I lose?" and "what data can I allow to fail because I can always retry?" I also needed to be able to interface to the application's settings through a window and it had to be responsive. The trick was that the UI had normal priority, the incoming requests one less and so on. My reasoning behind this was that since I will use the UI so seldom it can have the highest priority so that when I use it it will respond immediately. Threading here turned out to mean that all separate parts of the program in the normal case would/could be running simultaneously but when the system was under higher load, processing power would be shifted to the vital parts due to the prioritization scheme.
I've always liked shellsort so please spare me from pointers about quicksort this or that or blablabla. Or about how shellsort is ill-suited for multithreading. Having said that, the problem I had had to do with sorting a semi-largelist of units in memory (for my tests I used a reverse-sorted list of one million units of forty bytes each. Using a single-threaded shellsort I could sort them at a rate of roughly one unit every two us (microseconds). My first attempt to multithread was with two threads (though I soon realized that I wanted to be able to specify the number of threads) and it ran at about one unit every 3.5 seconds, that is to say SLOWER. Using a profiler helped a lot and one bottleneck turned out to be the statistics logging (i e compares and swaps) where the threads would bump into each other. Dividing up the data between the threads in an efficient way turned out to be the biggest challenge and there is definitley more I can do there such as dividing the vector containing the indeces to the units in cache-line size adapted chunks and perhaps also comparing all indeces in two cache lines before moving to the next line (at least I think there is something I can do there - the algorithms get pretty complicated). In the end, I achieved a rate of one unit every microsecond with three simultaneous threads (four threads about the same, I only had four cores available).
As to the original question my advice to you would be
If you have the time, learn the threading mechanism at the lowest possible level.
If performance is important learn the related mechanisms that the OS provides. Multi-threading by itself is seldom enough to achieve an application's full potential.
Use profiling to understand the quirks of multiple threads working on the same memory.
Sloppy architectural work will kill any app, regardless of how many cores and systems you have executing it and regardless of the brilliance of your programmers.
Sloppy programming will kill any app, regardless of the brilliance of the architectural foundation.
Understand that using libraries lets you reach the development goal faster but at the price of less understanding and (usually) lower performance .

Before giving any advice on do's and dont's about multi-thread programming in C++, I would like to ask the question Is there any particular reason you want to start writing the application in C++?
There are other programming paradigms where you utilize the multi-cores without getting into multi-threaded programming. One such paradigm is functional programming. Write each piece of your code as functions without any side effects. Then it is easy to run it in multiple thread without worrying about synchronization.
I am using Erlang for my development purpose. It has increased by productivity by at least 50%. Code running may not be as fast as the code written in C++. But I have noticed that for most of the back-end offline data processing, speed is not as important as distribution of work and utilizing the hardware as much as possible. Erlang provides a simple concurrency model where you can execute a single function in multiple-threads without worrying about the synchronization issue. Writing multi-threaded code is easy, but debugging that is time consuming. I have done multi-threaded programming in C++, but I am currently happy with Erlang concurrency model. It is worth looking into.

Make sure you know what volatile means and it's uses(which may not be obvious at first).
Also, when designing multithreaded code, it helps to imagine that an infinite amount of processors is executing every single line of code in your application at once. (er, every single line of code that is possible according to your logic in your code.) And that everything that isn't marked volatile the compiler does a special optimization on it so that only the thread that changed it can read/set it's true value and all the other threads get garbage.

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks

Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.

If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.

I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.

As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing

I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.

Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...

To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.

Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick

Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.

One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.

Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.

I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.

I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.

It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!

You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.

There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.

Explicit code parallelism in c++

Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?

Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.

Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.

Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.

Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.

There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?

It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)

Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.

This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js