General tips to improve multithreading (in C++)

General tips to improve multithreading (in C++) - c++

I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)
Single thread
real 5m50.008s
user 5m49.072s
sys 0m0.877s
Multi thread (24 threads)
real 1m22.572s
user 28m28.206s
sys 0m4.170s
The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.
- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?
- Are there general tips in order to improve the performance of multithreaded processes?

I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:
Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.
Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).
Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.
Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.
Profile your code and see how much time is spent on busy waiting and where that may be occurring.
Learn the consequences of using different schedule strategies. Try what's better, don't assume.
If you use critical sections, name them. All unnamed CSs have to wait for each other.
If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.
Browse similar questions on Stack Overflow, e.g., the wonderful answers here.

Related

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?

No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.

I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

Intel TBB disable nested parallelism

Consider the following scenario: I am writing a function, within which there is a computationally intensive loop. I parallelized it with TBB's parallel_for. Now, the problem is that this function may be used on its own, and benefits from the parallelization. Or it maybe used within another loop. In the later case, the outer loop can also be parallelized. And often, it is better to only parallelize the outer loop.
Normally in TBB parallelize both the outer and inner loop is not a problem, since unlike OpenMP, nested parallelization in TBB does not results in additional threads being created. TBB merely create more tasks. However, sometime the overhead of the creating more tasks in the inner loop is still undesirable (I observed a 40% slowdown in one extreme situations).
So is there a way to have TBB do not create any task when parallel_for etc is invoked while execution another parallel_for algorithm? Similar to the effect of OMP_NESTED=FALSE for OpenMP.

Simple answer: No
Simple advice: Don't use simple_partitioner
There is no way to affect the parallel_for or other algorithms from outside or on the outer level except restricting their concurrency via task_scheduler_init or task_arena. Though, they are not well-suited for nested parallelism in any case.
Anyway, there should not be such a big impact on the performance if auto_partitioner is used (especially, on the nested level) and you follow TBB recommendation on the amount of work which is efficient for parallelization.
Though I admit that in the extreme cases it can be a problem. We (TBB developers) thought on optimizing auto partitioning parameters of parallel_for depending on the context where it is being executed. But the issue is that knowing whether we are on the nested level or not is not enough to reliably define the parameters. E.g. consider when a parallel_for is launched from a single task: formally, it is nesting but there is no parallelism on the outer level. Some parts of the task scheduler needs to be significantly reworked to be able to provide information about the number of busy workers at any given time in order to enable this idea.

Parallelization efficiency of openMP

I have a C++ code containig many for-loops parallelized with openMP on a 8-thread computer.
But the speed of execution with single thread is faster than parallel 8 thread. I was told that if the load of the for-loops increases parallelization will become efficient.
Here with load I mean for example maximum number of iterations for a loop. The thing is I dont have a chance to compare single and 8-thread parallel code for a huge amount of data.
Should I use parallel code anyway? Is it true that parallelization efficiency will increase with load of for-loops?

The canonical use case for OpenMP is the distribution among a team of threads of the iterations of a high iteration count loop with the condition that the loop iterations have no direct or indirect dependencies.
You can spot what I mean by direct dependencies by considering the question Does the order of loop iteration execution affect the results ?. If, for example, iteration N+1 uses the results of iteration N you have such a dependency, running the loop iterations in reverse order will change the output of the routine.
By indirect dependencies I mean mainly data races, in which threads have to coordinate their access to shared data, in particular they have to ensure that writes to shared variables happen in the correct sequence.
In many cases you can redesign a loop-with-dependencies to remove those dependencies.
IF you have a high iteration count loop which has no such dependencies THEN you have a candidate for good speed-up with OpenMP. Here are the buts:
There is some parallel overhead to the computation at the start and end of each such loop, if the loop count isn't high enough this overhead may outweigh, partially or wholly, the speedup of running the iterations in parallel. The only way to determine if this is affecting your code is to test and measure.
There can be dependencies between loop iterations more subtle than I have already outlined. Depending on your system architecture and the computations inside the loop you might (without realising it) program your threads to fight over access to cache or to I/O resources, or to any other resource. In the worst cases this can lead to increasing the number of threads leading to decreasing execution rate.
You have to make sure that each OpenMP thread is backed up by hardware, not by the pseudo-hardware that hyperthreading represents. One core per OpenMP thread, hyperthreading is snake oil in this domain.
I expect there are other buts to put in here, perhaps someone else will help out.
Now, turning to your questions:
Should I use parallel code anyway? Test and measure.
Is it true that parallelization efficiency will increase with load of for-loops? Approximately, but for your code on your hardware, test and measure.
Finally, you can't become a serious parallel computationalist without measuring run times under various combinations of circumstances and learning what the measurements you make are telling you. If you can't compare sequential and parallel execution for huge amounts of data, you'll have to measure them for modest amounts of data and understand the lessons you learn before making predictions about behaviour when dealing with huge amounts of data.

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks

Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.

If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.

I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.

As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing

I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.

Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...

To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.

Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick

Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.

One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.

Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.

I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.

I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.

It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!

You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.

There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.

Explicit code parallelism in c++

Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?

Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.

Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.

Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.

Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.

There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?

It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)

Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.

This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js