Multithreading concept questions

Multithreading concept questions - c++

I just had to write a program in which I have to do matrix multiplication using threads, where there's a thread for every multiplication.
Now i'm wondering a few things,
Are there really any advantages to using threads for multiplying a 3x2 matrix and a 2x3 matrix? for something small, sequential code is still efficient? If i'm wrong are there any advantages or disadvantages to something so small? I just see the complication too great for something so small.
On the other hand, would having a 10000x10000 matrix have a benefit in using threads? I would assume so, locality comes into play, but I'm still wrapping my head around when multithreading is more efficient, or not.
Thanks!

Generally, you never want to update values from same cache lines by multiple threads, that would kill performance. You also want to utilize SIMD units within threads. Both are typically achieved due to some kind of processing data in blocks (look for register blocking / cache blocking terms). Also, ideally, you want to create just as many threads as the hardware concurrency is (to prevent expensive context switching). For data parallelism (such as matrix multiplication), this is easier. For task parallelism, thread pools are typically employed.
For small matrices like 3x2, multithreading would be definitely much much slower than sequential processing. For larger matrices, you need to measure to find out the threshold where multithreading will be faster. That threshold depends on too many parameters to provide generic answer.
Also, I don't understand what do you mean by
there's a thread for every multiplication
Do you want to create a single thread for every multiplication of 2 scalars? This would create zillion of threads for large matrices, which would be terribly slow.

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.

What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.

A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

Sharing a data set across threads vs. splitting up the data per thread

I have written a small program that generates images of the Mandelbrot set, and I have been using it as an opportunity to teach myself multithreading.
I currently have four threads that each handle calculating a quarter of the data. When they finish, the data is aggregated to then be drawn to a bitmap.
I'm currently pre-calculating all the complex numbers for each pixel in the main thread and putting them into an vector. Then, I split the vector into four smaller vectors to pass into each thread to modify.
Is there a best practice here? Should I be splitting up my data set so that the threads can work without interfering with eachother, or should I just use one data set and use mutexs/locking? I suppose benchmarking would probably be my best bet.
Thanks, let me know if you'd want to see my code.

The best practice is make threads as independent of each other as possible. I'm not familiar with the particular problem you're trying to solve, but if it allows equally dividing work among threads, splitting up the data set will be the most efficient way. When splitting data, have false sharing in mind, and minimize cross-thread data movements.
Choosing other parallelisation strategies makes sense on cases where, e.g.,:
Eliminating cross-thread dependencies requires a change to the algorithm that will cause too much extra work.
The amount of work per thread isn't balanced, and you need some dynamic work assignment to ensure all threads are busy until work is completed.
The algorithm is composed of different stages such that task parallelism is more efficient than data parallelism (namely, each stage is handled by a different thread, and data is pipelined between threads. This makes sense if there are too many dependencies within each stage).
Bear in mind that a mutex/lock means wasted time waiting, as well as possibly non-trivial synchronisation overhead if the mutex is a kernel object. However, correctness comes first: if other options are too difficult to get right, you'll lose more than you'll gain. Finally, always compare your parallel implementation to a sequential one. Due to data movements and dependencies, the sequential implementation often runs faster than the parallel one.

Is it thread-safe to access a Mat with multiple threads in OpenCV?

i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?

As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.

Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.

Improving image processing speed

I am using C++ and OpenCV to process some images taken from a Webcam in realtime and I am looking to get the best speed I can from my system.
Other than changing the processing algorithm (assume, for now, that you can't change it). Is there anything that I should be doing to maximize the speed of processing?
I am thinking maybe Multithreading could help here but I'm ashamed to say I don't really know the ins and outs (although obviously I have used multithreading before but not in C++).
Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?...or would the management overhead of these threads negate it assuming that I am looking for a throughput of 20fps (I assume that will affect the answer you give as it should give you an indication of how much processing will be done per thread)
Would multithreading help here?
Are there any tips for increasing the speed of OpenCV specifically, or any pitfalls that I might be falling into that reduce speed.
Thanks.

The easier way, I think, could be pipelining frame operations.
You could work with a thread pool, allocating sequentially a frame memory buffer to the first available thread, to be released to pool when the algorithm step on the associated frame has completed.
This could leave practically unchanged your current (debugged :) algorithm, but will require substantially more memory for buffering intermediate results.
Of course, without details about your task, it's hard to say if this is appropriate...

There is one important thing about increasing speed in OpenCV not related to processor nor algorithm and it is avoiding extra copying when dealing with matrices. I will give you an example taken from the documentation:
"...by constructing a header for a part of another matrix. It can be a
single row, single column, several rows, several columns, rectangular
region in the matrix (called a minor in algebra) or a diagonal. Such
operations are also O(1), because the new header will reference the
same data. You can actually modify a part of the matrix using this
feature, e.g."
// add 5-th row, multiplied by 3 to the 3rd row
M.row(3) = M.row(3) + M.row(5)*3;
// now copy 7-th column to the 1-st column
// M.col(1) = M.col(7); // this will not work
Mat M1 = M.col(1);
M.col(7).copyTo(M1);
Maybe you already knew this issue but I think it is important to highlight headers in openCV as an important and efficient coding tool.

Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?
Yes, although it very heavily depends on the particular algorithm being used, as well as your skill in writing threaded code to handle things like synchronization. You didn't really provide enough detail to make a better assessment than that.
Some algorithms are extremely easy to parallelize, like ones that have the form:
for (i=0; i < DATA_SIZE; i++)
{
output[i] = f(input[i]);
}
for some function f. These are known as embarassingly parallelizable; you can simply split the data into N blocks and have N threads process each block individually. Libraries like OpenMP make this kind of threading extremely simple.

Unless the particular algorithm you are using is already optimized for a multithreaded/parallel platform, throwing it at an x-core processor will do nothing for you. The algorithm has to be inherently threadable to benefit from multiple threads. But if it wasn't designed with that in mind, it would have to be altered. On the other hand, many image processing algorithms are "embarassingly-parallel", at least in concept. Can you share more details about the algorithm you have in mind?

If your threads can operate on different data, it would seem reasonable to thread it off, perhaps queueing each frame object to a thread pool. You may have to add sequence numbers to the frame objects to ensure that the processed frames emerging from the pool are delivered in the same order they went in.

As example code for multi-threaded image processing with OpenCV, you might want to check out my code:
https://github.com/vmlaker/sherlock-cpp
It's what I came up with wanting to take advantage of x-core CPU to improve performance of object detection. The detect program is basically a parallel algorithm that distributes tasks among multiple threads, a separate pipelined thread for every task:
Allocation of frame memory and video capture.
Object detection (one thread per each Haar classifier.)
Augmenting output with detection result and displaying the frame.
Memory deallocation.
With memory for every captured frame shared between all threads, I got great performance and CPU utilization.

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks

Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.

If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.

I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.

As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing

I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.

Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...

To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.

Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick

Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.

One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.

Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.

I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.

I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.

It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!

You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.

There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js