OpenMP and cores/threads - c++

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.

EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...

To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.


Threading: Most efficient way for many repeated parallel sweeps over a small array?

I'm optimizing a solver (systems of linear equations) whose most critical part consists of
Many (1000+) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:
array_out[j] = some_function( array_in[j] )
swap( array_in, array_out );
Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far
(serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))
EDIT to address answers,questions & comments so far:
The example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line.
some_func(array_in, j) is a simple stencil that accesses the current point j and a small neighborhood around it, e.g. sume_func( array, j ) = array[j-1]+array[j]+array[j+1];
The answer given by Jérôme Richard touches a very intersting point
about barriers ( here is IMO the root of the problem). It is suggested to "replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key". Interesting but very general. How exactly would this be accomplished in this case ?
Does "point-to-point-neighbor synchronization" involve an atomic primitive for every entry of the array ?
The general solution to this problem is to create the threads and distribute the work only once, and then use fast synchronization point in the threads. In this case, the outer loop is moved in the threaded function. This is possible with the TBB library by providing a range (tbb::blocked_range<size_t> ) and a function to tbb::parallel_for (see an example here).
Barriers are known to scale poorly on many core architectures, especially in such codes. The only way to make the program scale is to reduce the synchronization between threads so to make it more local. For example, for stencils, the solution is to replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key to solve such problem. In fact, note the fundamental laws of physics prevent barriers to scale because clocks cannot be fully synchronized in general relativity and computers (unfortunately) obeys to physics law.
Many core systems are now nearly always NUMA ones. Regarding your configuration, you certainly use an AMD Threadripper processor which have multiple NUMA nodes. This means you should care about locality and the NUMA allocation policy. The default policy is generally the first touch. This means that is your initialization is sequential or threads are mapped differently, then cores have to fetch data from remote NUMA nodes which is slow. In the worst case, all cores can access to the same NUMA node and saturate it resulting in a possibly slower execution than the sequential version. You should generally make it parallel for better performance. Getting high-performance on such architecture is far from being easy. You need to carefully control the allocation policy (numactl can help for that), the initialization (parallel), the thread binding (with taskset, hwloc and/or environment variables) and the memory access pattern (by reading articles/books about how NUMA machines work and applying dedicated algorithms).
By the way, there is probably a false-sharing effect happening in your code because cache lines of array_out are certainly shared between thread. This should not have a critical impact but it does contribute to get a poor scalability (especially on your 64-core processor). The general solution to this problem is to align the array in memory on a cache line and take take the parallel splitting is done on a cache line boundary. Alternatively, you can allocate the array part in each thread. This is generally a better approach as is ensure data is locally allocated, locally filled and make data-sharing/communication between NUMA nodes and even cores more explicit (ie. better control), though it can make the code more complex (there is no free lunch).
Sharing data across threads is slow. This is because each CPU core has at least one layer of personal cache. The minute you share data between cores/threads, the personal caches need to be synchronised which is slow.
Threads running in parallel across different cores work best if they do not share data.
In your case, if data is read only you might be best off giving each thread its own copy of the data. For read write data, you have to accept the synchronisation slowdown.

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

Is this behavior showing that I have a memory problem?

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?
Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

What is the expected speed-up of using parallelization in C++ (not OpenMp, but <thread>)

What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)
This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.