OpenMP - sections - c++

I wrote an application using OpenMP. I created two sections and put into them two objects. Each of them calls a method which is running nearly 22-23 seconds. Both sections are independent.
When I set num_threads(1), the application takes 46 seconds to run. That's ok, because 2×23=46.
When I set num_threads(2), the application takes 35 seconds to run, but I was expecting ~25 seconds.
As I said, the sections are independent. cm1 and cm2 don't use any external variables. So, could anyone tell me why my app is 10 seconds slower than I expected? Is there any synchronization on low level?
t1 = clock();
#pragma omp parallel num_threads(2)
{
#pragma omp sections
{
#pragma omp section
{
Cam cm1;
cm1.solveUsingCost();
}
#pragma omp section
{
Cam cm2;
cm2.solveUsingTime();
}
}
}
t2 = clock();

How many CPUs or cores do you have? If for example you have only 2 physical cores, one of them will also have to process all other programm + OS therefore this will slow down one of the threads.
Another possibility is that the L3 chache of your CPU is sufficent to save the data of one calculation at a time completly in L3 cache. But when doing 2 in paralell the double amount of memory is used and therefore maybe some memory from the L3 cache has to be transfered to the ram (Note that most Multicore CPUs share L3 cache between the cores). This will slow down your calculations a lot and could lead to the described results.
However these are only guesses, there could be a lot further reasons why there is not factor 2 speed gain, when doing your calculation in parallel.
Update:
Of course what I forgot until you mentioned your CPU being an i5: i5 and i7 processors have this "Turbo boost" called ability to increase thier clock speed, in your case from 3.3 to 3.6 GHz. This however is only done when most cores are in idle (for thermal reasons I think) and a single core is boosted. Therefore two cores will not have double the speed of one core because they will run at lower clock speed.

Judging from your replies to the previous answer and comments, my guess is that your two functions, solveUsingCost() and solveUsingTime(), are quite memory-intensive or at least memory bandwidth limited.
What exactly are you computing? And how? What is, roughly, the ratio of arithmetic operations per memory access? How is your memory access patterned, e.g. do you run through a large array several times?

Related

Is there an idiomatic way to solve small isolated independent task with CUDA?

I wrote my first CUDA program which I am trying to speed up. I ask myself if this is possible since the problem is not really appropriate for SMID-Processing (Single instruction, multiple data). It is more a "single function, multiple data" problem. I have many similar tasks to be solved independently.
My current approach is:
__device__ bool solve_one_task_on_device(TaskData* current){
// Do something completely independent of other threads (no SMID possible).
// In my case each task contains an array of 100 elements,
// the function loops over this array and often backtracks
// depending on its current array value until a solution is found.
}
__global__ void solve_all_tasks_in_parallel(TaskData* p, int count){
int i = blockIdx.x * blockDim.x + threadIdx.x ;
if(i < count) {
solve_one_task_on_device(&p[i]);
}
}
main {
...
// tasks_data is an array of e.g 4096 tasks (or more) copied to device memory
solve_all_tasks_in_parallel<<<64, 64>>>(tasks_data, tasks_data_count);
...
}
What would be the idiomatic way to do it? If it should be done at all.
I already read some threads about the topic. I have 512 CUDA cores. What I am less sure is if each CUDA Core can really independently solve a task so that
instructions are not synchronized. How many task-parallel threads can I start and what would be the recommended way to do the parallelization?
It is an experiment, I want to see if CUDA is useful for such problems at all.
Currently running my function in parallel on the CPU is much faster. I guess I cannot get similar performance on my GPU unless I have a SMID problem, correct?
My hardware is:
CPU: Xeon E5-4650L (16 Cores)
GPU: NVIDIA Quadro P620
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
I have 512 CUDA cores.
Remember "CUDA cores" is just NVIDIA marketing speech. You don't have 512 cores. A Quadro P620 has 4 cores; and on each of them, multiple warps of 32 threads each can execute the same instruction (for all 32). So, 4 warps, each executing an instruction, on each of 4 cores, makes 512. In practice, you usually get less than 512 instructions executing per clock cycle.
What I am less sure is if each CUDA Core can really independently solve a task so that instructions are not synchronized.
So, no. But if you arrange your tasks so that the logic is the same for many of them and only the data is different, then there's a good chance you'll be able to execute these tasks in parallel, effectively. Of course there are many other considerations, like data layout, access patterns etc to keep in mind.
On different cores, and even on the same core with different warps, you can run very different tasks, completely independently past the point in the kernel where you choose your task and code paths diverge.

How to optimize code for Simultaneous Multithreading?

Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.
However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.
I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.
Which book or resource should I look at if I want to learn more about this topic? Thank you.
EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.
Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.
Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.
SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.
So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.
The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.

Enabling Open MP Support in Visual Studio 2017 slows down codes

I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
{
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
{
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
}
grad = weights[0].transpose()*delta[layerNo - 2];
}
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.

CPU speed and threads in C++

I have the following C++ program:
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp++;
}
}
int main() {
using namespace boost;
timer aTimer;
// start two new threads that calls the "testSpeed" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double IOPS = 1/elapsedSec;
}
So the idea is to test the CPU speed in terms of integer operations per second (IOPS).
There are 1 billion iterations (operations), so on 1Ghz CPU we should get around billion integer operations per second, I believe.
My assumption is that more threads = more integer operations per second. But the more threads I try the less operations per second I see (I have more cores than threads).
What may causing such a behavior? Is it the threads overhead? Maybe I should try a much longer experiment to see if the threads actually help?
Thank you!
UPDATE:
So I changed the loop to run 18 billions times, and declared temp as volatile. Also added another testSpeed method with different name so now a single threaded executes both methods one after another, while two threads get each one method; so there shouldn't be any sync' issues, etc. And... still no change in behavior! single threaded is faster according to timer. Ahhh! I found the sucker, apparently timer is bluffing. The two threads take half the time to finish but timer tells me the single threaded run was two seconds faster. I'm now trying to understand why... Thanks everyone!
I am almost certain that compiler optimizes away your loops. Since you do not subtract the overhead of creating/synchronizing threads, you actually measure only that. So more threads you have, more overhead you create and more time it takes.
Overall, you can refer to the documentation of your CPU and find out about its frequency and how much ticks any given instruction takes. Testing it yourself using an approach like this is nearly impossible and is, well, useless. This is because of an overhead like context switches, transferring the execution from one CPU/core to another, scheduler swap-outs, branch mis-prediction. In real life you will also encounter cache misses and a lot of memory bus latency since there is no programs that fit into ~ 15 registers. So you better test a real program using some good profiler. For example, latest CPUs can give out CPU stall information, cache misses, branch mispredictions and a lot more. You can use a good profiler to actually decide when and how to parallel your program as well.
As the number of threads increases beyond a certain point, it leads to an increase in the number of cache misses (cache is being shared among the threads), but at the same time memory access latency is being masked by a large number of threads(while a thread is waiting for data to be fetched from the memory, other threads are running). Hence there is a trade off. Here is an interesting paper on this subject.
According to this paper, on a multi-core machine when the number of threads is very low (of the order of number of cores), the performance will increase on increasing the number of threads, because now the cores are being fully utilized.
After that, a further increase in the number of threads leads to the effect of cache misses dominating, thus leading to a degradation in the performance.
If the number of threads become very large, such that the amount of cache storage per thread almost become almost zero, all memory accesses are made from the main memory. But at the same time increased number of threads is also very effectively masking the increased memory access latency. This time the second effect dominates leading to an increase in the performance.
Thus the valley in the middle is the region with the worst performance.

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
{
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
}
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.
Thanks!
EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...
To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.