OpenMP 'slower' on iMac? (C++) - c++

I have a small C++ program using OpenMP. It works fine on Windows7, Core i7 with VisualStudio 2010. On an iMac with a Core i7 and g++ v4.2.1, the code runs much more slowly using 4 threads than it does with just one. The same 'slower' behavior is exihibited on 2 other Red Hat machines using g++.
Here is the code:
int iHundredMillion = 100000000;
int iNumWorkers = 4;
std::vector<Worker*> workers;
for(int i=0; i<iNumWorkers; ++i)
{
Worker * pWorker = new Worker();
workers.push_back(pWorker);
}
int iThr;
#pragma omp parallel for private (iThr) // Parallel run
for(int k=0; k<iNumWorkers; ++k)
{
iThr = omp_get_thread_num();
workers[k]->Run( (3)*iHundredMillion, iThr );
}
I'm compiling with g++ like this:
g++ -fopenmp -O2 -o a.out *.cpp
Can anyone tell me what silly mistake I'm making on the *nix platform?

I'm thinking that the g++ compiler is not optimizing as well as the visual studio compiler. Can you try other optimization levels (like -O3) and see if it makes a difference?
Or you could try some other compiler. Intel offers free compilers for linux for non-commercial purposes.
http://software.intel.com/en-us/articles/non-commercial-software-development/

It's impossible to answer given the information provided, but one guess could be that your code is designed so it can't be executed efficiently on multiple threads.
I haven't worked a lot with OMP, but I believe it is allowed to use fewer worker threads than specified. In that case, some implementations could be clever enough to realize that the code can't be efficiently parallellized, and just run it on a single thread, while others naively try to run it on 4 cores, and suffer the performance penalty (due to false (or real) sharing, for example)
Some of the information that'd be necessary in order to give you a reasonable answer is:
the actual timings (how long does the code take to run on a single thread? How long with 4 threads using OM? How long with 4 threads using "regular" threads?
the data layout: which data is allocated where, and when is it accessed?
what actually happens inside the loop? All we can see at the moment is a multiplication and a function call. As long as we don't know what happens inside the function, you might as well have posted this code: foo(42) and asked why it doesn't return the expected result.

Related

How can I multithread this code snippet in C++ with Eigen

I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.
OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).

How is cilk reduce done (thread vs smid)

I have something like that :
for (b=from; b<to; b++)
{
for (a=from2; a<to2; a++)
{
dest->ac[b] += srcvec->ac[a] * srcmatrix->weight[a+(b+from)*matrix_width];
}
}
that i'd like to parallelize using cilk. I have written the following code :
for ( b=from; b<to; b++)
{
dest->ac[b] =+ __sec_reduce_add(srcvec->ac[from2:to2-from2] * (srcmatrix->weight+(b*matrix_width))[from2:to2-from2]);
}
but the thing is, I could use a cilk_for on the primary loop, but if the reduce operation is already spawning thread, won't the cilk_for augment the thread overhead, and slow the whole thing down ?
And should I add restrict to dest and src args to further help the compiler ? or is it implicit in this case ?
(ps: I can't try the code right now because of
internal compiler error: in find_rank, at
c-family/array-notation-common.c:244
on
neu1b->ac[0:layer1_size]=neu1->ac[0:layer1_size];
that i'am trying to solve also.)
restrict is not implicitely the case. Furthermore Cilk is implemented using the work-stealing concept. Cilk does not necessarily spawn extra threads for extra work. It works with pushing tasks on a work stack. More info about the internal working can be found on the Cilk FAQ. The Intel compiler might handle things differently than GCC with Cilk. Intel vTune and the intel vectorization report can help you to measure performance differences and indicate whether it's compiled to SIMD or not. With the Intel compiler you can also indicate SIMD operations as follows:
#pragma simd above your loop
or
array notations:
a[:] = b[:] + c[:] to program vectorized array operations.

Strange ratio in speedup between release and debug builds in game "Life"

I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

OpenMP C++ parallel performance better dualcore laptop than eight cores cluster

First of all, OpenMP obviously only runs in one of the motherboards in the cluster, in this case each motherboard has two quad-core Xeons E5405 at 2GHz and its running Scientific Linux 5.3 (released in 2009, red hat based). My laptop on the other hand a has core2duo T7300 at 2GHz running windows 7. No hyperthreading in either machine.
The main problem is that I have OOP code that generally runs for around 2min in serial in both systems, but when I implement OpenMP in a nested loop it experieces an expected reduction in time in my laptop (when 2 threads are used) and a significant increase in time in the server (around 5min with two threads, for example).
There are two classes, "cube" and "space". Space contains a three dimensional array (20x20x20) of cubes and the code that I am trying to parallelise is a three way nested loop that calls a member function of cube for each of the cubes. This member function has three arguments (doubles) and does some calculations based on the private variables of each cube.
inline void space::cubes_refresh(const double vsx, const double vsy, const double vsz) {
int loopx, loopy, loopz;
#pragma omp parallel private(loopx, loopy, loopz)
{
#pragma omp for schedule(guided,1) nowait
for(loopx=0 ; loopx<cubes_w ; loopx++) {
for(loopy=0 ; loopy<cubes_h ; loopy++) {
for(loopz=0 ; loopz<cubes_d ; loopz++) {
// Refreshing the values in source
if ( (loopx==source_x)&&(loopy==source_y)&&(loopz==source_z) )
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,vsz);
// refresh everything else
else
cube_array[loopx][loopy][loopz].refresh(0.0,0.0,0.0);
}
}
} // End of loop
}
I don't know where the problem could be, as I have said before, in my laptop I see an expected improvement in performance, but exactly the same code in the server does significantly worse.
These are the flags I use in my laptop (have tried using exactly the same flags, but nothing):
g++ -std=c++98 -fopenmp -O3 -Wl,--enable-auto-import -pedantic main.cpp -o parallel_openmp
And in the server:
g++ -std=c++98 -fopenmp -O3 -W -pedantic main.cpp -o parallel_openmp
I'm running gcc version 4.5.0 and the server is running 4.1.2, I don' know the OpenMP version in the server as I don't know how to check it, I think is a version before 3.0 as the collapse in loops does not work. Could this be the problem?
gcc did not support OpenMP until 4.2, OpenMP 3.0 was supported starting in gcc 4.4. Your operating system vendor may have back ported the changes to 4.1.2.
The only thing I can think maybe causing the problem is that for some reason in the server all the threads accessing the cube member array is causing a lot cache misses, but wouldn't this also happen in the program running in my laptop?