OpenCL clEnqueueTasks Parallelism - c++

I am trying to write some code that does AES Decryption. I have the code working but I wanted to be able to add Cipher Block Chaining which requires that I do an XOR operation after the decryption.
To make the code easier to write and understand I wrote the code using two kernels. One that does the decryption on a single block and one that does the XOR for the CBC part. I then submitted these to the queue via clEnqueueTask for each 16byte block of data with the dependency specified by an event between the Decryption and XOR.
This turns out to be very slow, it works in the fact that it does them in the correct order however it does not seem to be parallelizing the execution.
Does anyone know why or how to improve the performance without losing the granularity?

clEnqueueTask is typically used for single-threaded tasks.
If your kernels can execute in parallel, use one clEnqueueNDRangeKernel call instead of lots of clEnqueueTask calls with different parameters.
Something else that might prevent good parallel performance is lots of global memory access. If you are doing lots of reads/writes of global memory in your kernel compared to the amount of computation, that might be slowing you down depending on your hardware.

The kernel executed via clEnqueueTask is essentially single-threaded, which means the global work-size is 1 and the task occupy a whole compute unit for this single thread. This can be a great impact performance wise, because on a typical GPU you can execute 8-16 tasks/work-groups parallel (CL_DEVICE_MAX_COMPUTE_UNITS), and in a work-group you can execute 256-1024 (CL_DEVICE_MAX_WORK_GROUP_SIZE). So in your case you can achieve 8-16x parallelism instead of the theoretical maximum 15000x because you cannot utilize the whole hardware.

Related

OpenCL Kernel performance is very bad. Why my code is better without OpenCL?

I'm writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.
I dont understand why. The operations in the kernel are mostly without control structures (like if/else).
Kernels:
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Ant.cl
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Pheromon.cl
I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)
Can you give me advice?
You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/clInitFunctions.cpp).
Thanks :)
You have a lot of if/else, can't you write it in a different way?
Don't follow the if/else path, since you will never reach anywhere.
You need to make the GPU will only execute useful instructions. Not millions of if/else.
It may be better to keep track and execute only the ants that are live in the grid. You better keep track of them and move them around. Having stored their coordinates.
You will obviously need as well a map with the ant positions and status, so you will need a multi kernel system.
In addition, you have a los of non-useful memory transfers, starting from using int variables for single boolean storage. This can lead to 90% of non useful transfer that can bottleneck the GPU.
Your OpenCL kernels have ifs. Current GPUs aren't supposed to do that. AFAIK an AMD GPU has n groups of 64 cores that have the same instruction pointer (they are executing the exact same part of the exact same statement). Ifs are implemented by stopping some of the cores, executing the true branch, stopping the others and executing the false branch. Imagine this with nested ifs or loops.

Multithreaded mex code slower than single threaded

I am writing mex code in MATLAB to do and operation (because the operation uses a library in c++). The mex code has a section where there is a function that is repeatedly called in a loop with a different argument value, and each function call is independent (i.e., computation of 1 call does not depend on previous calls). So, to speed this up I wrote multithreaded code that creates multiple threads - the exact number of threads is equal to the number of loop iterations, in my example this value is 10. Each thread computes the function in the loop for a separate value of the argument, the threads return and join, some more computation is done and a result is returned.
All this in theory should give me good speedup, but I see that the multithreaded code is a lot slower than the normal single threaded one!! I have access to very powerful 24 core machines, so this is totally baffling, because I'd expected each thread to be scheduled on a separate core.
Any ideas to what is leading to this? Any common problems/errors in code that lead to this?
Any help will be greatly appreciated.
EDIT:
To answer many doubts raised in solutions proposed by people here, I want to share some information about my code:
1. Each function call takes a few minutes, so synchronization and spawning of threads should not be an overhead here (though if there are any mitigating circumstances in this case, any info about that would be really helpful!)
Each thread does access common data structures, arrays, matrices but the values in these are not overwritten at all. All writes to variables are done to variables, pointers, arrays, etc that are local to the thread. So, I am guessing there shouldn't be many cache misses here?
Also there are no mutex sections in my code, since no thread write to any common memory location. All writes are to memory locations local to the thread.
I'm still trying to figure out the reason why my multithreaded implementation is not working :( So, any pointers/info will be really helpful!
Thanks!!
Given how general your question is, the general answer is that there are probably two effects in play:
There is large overhead involved starting and stopping threads (and synchronizing them), and the computation scaling is not enough to overcome the overhead. The total times per function call will shed some light on this issue.
Threads can compete with each other and slow down the aggregate performance. A common mechanism is "cache thrashing". Since multiple cores share the same memory controller and parts of the cache hiearchy, one thread can fill the cache with the information it needs, only to have some of that data evicted by the needs of a different thread, causing more trips to main memory. Since main memory access is so expensive, the end result is a slowdown.
I would test the job with varying numbers of threads. It may turn out, for instance, that using two threads is advantageous, but four or more is not. For more detailed answers, add more details to the question, such as type of computation, size of dataset, etc.
You didn't describe what your code does, so this is just guesswork.
Multithreading is not a miracle cure. There are a lot of ways that multithreading what was a single threaded chunk of code can be slower than the original. There's a good deal of overhead involved in spawning, synchronizing, joining, and destroying threads.
Suppose the task at hand was to add ten pairs of numbers. If you make this multithreaded by spawning a thread for each addition and then joining and destroying when the calculation is finished, your multithreaded version will be much, much slower than the original. Threading is not intended for very short duration calculations. The costs of spawning, joining, and destroying are going to overwhelm any speedup you gain by performing those simple tasks in parallel.
Another way to make things slower is to establish barriers the prevent parallel operations. A mutex, for example, to protect against multiple writers simultaneously accessing the same object. That protected code needs to be small. Make the entire bodies of your thread operate under the guise of a mutex and you have the equivalent of a single threaded application that has a whole bunch of threading overhead added in.
Those barriers that preclude parallel execution might be present even if you didn't put them in place. Some of those barriers are in the C standard library. POSIX mandates that most library functions be thread safe. The standard only lists the functions that don't have to be thread safe. If you use library functions in those computations, you might be better of staying single threaded because your code essentially is single threaded.
I do not think your problems are mex specific at all - this sounds like usual performance problems while programing multi-threaded code for SMPs.
To add a little to the already mentioned potential problems:
False cache line sharing: you might think that your threads work independently, while in fact they access different data within the same cache line. Trivial example:
/* global variable accessible by all threads */
int thread_data[nthreads];
/* inside thread function */
thread_data[thrid] = some_value;
inefficient memory bandwidth utilization. On NUMA systems you want the CPUs to access their own data banks. If you do not correctly distribute the data, the CPUs ask for memory from other CPUs. That implies communication, which you do not suspect is there.
thread affinity. Somewhat connected to the point above. You want your threads to be bound to their own CPUs for the entire duration of the computations. Otherwise they might be migrated by the OS, which causes overhead, and they might be moved further away from the memory bank they will access.

The fastest way to write data while producing it

In my program I am simulating a N-body system for a large number of iterations. For each iteration I produce a set of 6N coordinates which I need to append to a file and then use for executing the next iteration. The code is written in C++ and currently makes use of ofstream's method write() to write the data in binary format at each iteration.
I am not an expert in this field, but I would like to improve this part of the program, since I am in the process of optimizing the whole code. I feel that the latency associated with writing the result of the computation at each cycle significantly slows down the performance of the software.
I'm confused because I have no experience in actual parallel programming and low level file I/O. I thought of some abstract techniques that I imagined I could implement, since I am programming for modern (possibly multi-core) machines with Unix OSes:
Writing the data in the file in chunks of n iterations (there seem to be better ways to proceed...)
Parallelizing the code with OpenMP (how to actually implement a buffer so that the threads are synchronized appropriately, and do not overlap?)
Using mmap (the file size could be huge, on the order of GBs, is this approach robust enough?)
However, I don't know how to best implement them and combine them appropriately.
Of course writing into a file at each iteration is inefficient and most likely slow down your computing. (as a rule of thumb, depends on your actuel case)
You have to use a producer -> consumer design pattern. They will be linked by a queue, like a conveyor belt.
The producer will try to produce as fast as it can, only slowing if the consumer can't handle it.
The consumer will try to "consume" as fast as it can.
By splitting the two, you can increase performance more easily because each process is simpler and has less interferences from the other.
If the producer is faster, you need to improve the consumer, in your case by writing into file in the most efficient way, chunk by chunk most likely (as you said)
If the consumer is faster, you need to improve the producer, most likely by parallelizing it as you said.
There is no need to optimize both. Only optimize the slowest (the bottleneck).
Practically, you use threads and a synchronized queue between them. For implementation hints, have a look here, especially ยง18.12 "The Producer-Consumer Pattern".
About flow management, you'll have to add a little bit more complexity by selecting a "max queue size" and making the producer(s) wait if the queue has not enough space. Beware of deadlocks then, code it carefully. (see the wikipedia link I gave about that)
Note : It's a good idea to use boost threads because threads are not very portable. (well, they are since C++0x but C++0x availability is not yet good)
It's better to split operation into two independent processes: data-producing and file-writing. Data-producing would use some buffer for iteration-wise data passing, and file-writing would use a queue to store write requests. Then, data-producing would just post a write request and go on, while file-writing would cope with the writing in the background.
Essentially, if the data is produced much faster than it can possibly be stored, you'll quickly end up holding most of it in the buffer. In that case your actual approach seems to be quite reasonable as is, since little can be done programmatically then to improve the situation.
If you don't want to play with doing stuff in a different threads, you could try using aio_write(), which allows asynchronous writes. Essentially you give the OS the buffer to write, and the function returns immediately, and finishes the the write while your program continues, you can check later to see if the write has completed.
This solution still does suffer from the producer/consumer problem mentioned in other answers, if your algorithm is producing data faster than it can be written, eventually you will run out of memory to store the results between the algorithm and the write, so you'd have to try it and see how it works out.
"Using mmap (the file size could be huge, on the order of GBs, is this
approach robust enough?)"
mmap is the OS's method of loading programs, shared libraries and the page/swap file - it's as robust as any other file I/O and generally higher performance.
BUT on most OS's it's bad/difficult/impossible to expand the size of a mapped file while it's being used. So if you know the size of the data, or you are only reading, it's great. For a log/dump that you are continually adding to it's less sutiable - unless you know some maximum size.

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks
Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.
If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.
I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.
As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing
I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.
Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...
To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.
Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick
Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.
One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.
Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.
I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.
I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.
It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!
You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.
There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.

Explicit code parallelism in c++

Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?
Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.
Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.
Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.
Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.
There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?
It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)
Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.
This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.