Enabling Open MP Support in Visual Studio 2017 slows down codes - visual-studio-2017

I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
{
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
{
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
}
grad = weights[0].transpose()*delta[layerNo - 2];
}
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.

Related

Incorrect measurement of the code execution time inside OpenMP thread

So I need to measure execution time of some code inside for loop. Originally, I needed to measure several different activities, so I wrote a timer class to help me with that. After that I tried to speed things up by paralleling the for loop using OpenMP. The problem is that when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10. So to avoid the possibility of flaw inside the timer class I started to measure execution time of the whole loop iteration, so structurally my code looks something like this:
#pragma omp parallel for num_threads(20)
for(size_t j = 0; j < entries.size(); ++j)
{
auto t1 = std::chrono::steady_clock::now();
// do stuff
auto t2 = std::chrono::steady_clock::now();
std::cout << "Execution time is "
<< std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count()
<< std::endl;
}
Here are some examples of difference between measurements in parallel and measurements in single thread:
Single-threaded
Multi-threaded
11.363545868021
94.154685442
4.8963048650184
16.618173163
4.939025568
25.4751074
18.447368772
110.709813843
Even though it is only a couple of examples, this behaviour seems to prevail in all loop iterations. I also tried to use boost's chrono library and thread_clock but got the same result. Do I misunderstand something? What may be the cause of this? Maybe I get cumulative time of all threads?
Inside the for loop, during each iteration I read a different file. Based on this file I create and solve multitude of mixed-integer optimisation models. I solve them with MIP solver, which I set to run in one thread. The instance of the solver is created on each iteration. The only variables that is shared between iteration are constant strings which represents paths to some directories.
My machine has 32 threads (16 cores, 2 threads per core).
Also here are the timings of the whole application in single-threaded mode:
real 23m17.763s
user 21m46.284s
sys 1m28.187s
and in multi-threaded mode:
real 12m47.657s
user 156m20.479s
sys 2m34.311s
A few points here.
What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads. But when we look at the real time reported by time, we see that your multithreaded code is running close to twice as fast as the single threaded code. So it is scaling to some degree--but not very well.
Reading a file in the parallel region may well be part of this. Even at best, the fastest NVMe SSDs can only support reading from a few (e.g., around three or four) threads concurrently before you're using the drive's entire available bandwidth (and if you're doing I/O efficiently that may well be closer to 2. If you're using an actual spinning hard drive, it's usually pretty easy for a single thread to saturate the drive's bandwidth. A PCIe 5 SSD should keep up with more threads, but I kind of doubt even it has the bandwidth to feed 20 threads.
Depending on what parts of the standard library you're using, it's pretty easy to have some "invisible" shared variables. For one common example, quite code that uses Monte Carlo methods will frequently have calls to rand(). Even though it looks like a normal function call, rand() will typically end up using a seed variable that's shared between threads, and every call to rand() not only reads but also writes to that shared variable--so the calls to rand() all end up serialized.
You mention your MIP solver running in a single thread, but say there's a separate instance per thread, leaving it unclear whether the MIP solving code is really one thread shared between the 20 other threads, or that you have one MIP solver instance running in each of the 20 threads. I'd guess the latter, but if it's really the former, then it's being a bottleneck wouldn't seem surprising at all.
Without code to look at, it's impossible to get really specific though.

Is there an idiomatic way to solve small isolated independent task with CUDA?

I wrote my first CUDA program which I am trying to speed up. I ask myself if this is possible since the problem is not really appropriate for SMID-Processing (Single instruction, multiple data). It is more a "single function, multiple data" problem. I have many similar tasks to be solved independently.
My current approach is:
__device__ bool solve_one_task_on_device(TaskData* current){
// Do something completely independent of other threads (no SMID possible).
// In my case each task contains an array of 100 elements,
// the function loops over this array and often backtracks
// depending on its current array value until a solution is found.
}
__global__ void solve_all_tasks_in_parallel(TaskData* p, int count){
int i = blockIdx.x * blockDim.x + threadIdx.x ;
if(i < count) {
solve_one_task_on_device(&p[i]);
}
}
main {
...
// tasks_data is an array of e.g 4096 tasks (or more) copied to device memory
solve_all_tasks_in_parallel<<<64, 64>>>(tasks_data, tasks_data_count);
...
}
What would be the idiomatic way to do it? If it should be done at all.
I already read some threads about the topic. I have 512 CUDA cores. What I am less sure is if each CUDA Core can really independently solve a task so that
instructions are not synchronized. How many task-parallel threads can I start and what would be the recommended way to do the parallelization?
It is an experiment, I want to see if CUDA is useful for such problems at all.
Currently running my function in parallel on the CPU is much faster. I guess I cannot get similar performance on my GPU unless I have a SMID problem, correct?
My hardware is:
CPU: Xeon E5-4650L (16 Cores)
GPU: NVIDIA Quadro P620
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
I have 512 CUDA cores.
Remember "CUDA cores" is just NVIDIA marketing speech. You don't have 512 cores. A Quadro P620 has 4 cores; and on each of them, multiple warps of 32 threads each can execute the same instruction (for all 32). So, 4 warps, each executing an instruction, on each of 4 cores, makes 512. In practice, you usually get less than 512 instructions executing per clock cycle.
What I am less sure is if each CUDA Core can really independently solve a task so that instructions are not synchronized.
So, no. But if you arrange your tasks so that the logic is the same for many of them and only the data is different, then there's a good chance you'll be able to execute these tasks in parallel, effectively. Of course there are many other considerations, like data layout, access patterns etc to keep in mind.
On different cores, and even on the same core with different warps, you can run very different tasks, completely independently past the point in the kernel where you choose your task and code paths diverge.

Recursive parallel function not using all cores

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.
First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efficient serial search algorithm, and thus, it
makes little sense to parallelize it."
Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/
Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

OpenMP - sections

I wrote an application using OpenMP. I created two sections and put into them two objects. Each of them calls a method which is running nearly 22-23 seconds. Both sections are independent.
When I set num_threads(1), the application takes 46 seconds to run. That's ok, because 2×23=46.
When I set num_threads(2), the application takes 35 seconds to run, but I was expecting ~25 seconds.
As I said, the sections are independent. cm1 and cm2 don't use any external variables. So, could anyone tell me why my app is 10 seconds slower than I expected? Is there any synchronization on low level?
t1 = clock();
#pragma omp parallel num_threads(2)
{
#pragma omp sections
{
#pragma omp section
{
Cam cm1;
cm1.solveUsingCost();
}
#pragma omp section
{
Cam cm2;
cm2.solveUsingTime();
}
}
}
t2 = clock();
How many CPUs or cores do you have? If for example you have only 2 physical cores, one of them will also have to process all other programm + OS therefore this will slow down one of the threads.
Another possibility is that the L3 chache of your CPU is sufficent to save the data of one calculation at a time completly in L3 cache. But when doing 2 in paralell the double amount of memory is used and therefore maybe some memory from the L3 cache has to be transfered to the ram (Note that most Multicore CPUs share L3 cache between the cores). This will slow down your calculations a lot and could lead to the described results.
However these are only guesses, there could be a lot further reasons why there is not factor 2 speed gain, when doing your calculation in parallel.
Update:
Of course what I forgot until you mentioned your CPU being an i5: i5 and i7 processors have this "Turbo boost" called ability to increase thier clock speed, in your case from 3.3 to 3.6 GHz. This however is only done when most cores are in idle (for thermal reasons I think) and a single core is boosted. Therefore two cores will not have double the speed of one core because they will run at lower clock speed.
Judging from your replies to the previous answer and comments, my guess is that your two functions, solveUsingCost() and solveUsingTime(), are quite memory-intensive or at least memory bandwidth limited.
What exactly are you computing? And how? What is, roughly, the ratio of arithmetic operations per memory access? How is your memory access patterned, e.g. do you run through a large array several times?

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
{
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
}
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.
Thanks!
EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...
To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.