I am working on parallel algorithms using OpenMP. Judging from the CPU usage, much of the "sequential" code I write is actually executed in parallel.
For example:
#pragma omp parallel for if (par == "parallel")
for (int64_t u = 1; u <= n; ++u) {
for (int64_t v = u + 1; v <= n; ++v) {
....
}
}
This is conditionally parallel if a flag is set. With the flag set, I see CPU usages of 1500% on a 16 core machine. With the flag not set, I still see 250% CPU usage.
I suppose this is due to some autoparallelization going on. Correct? Does GCC do this?
Since I need to compare sequential and parallel running times, I would like code not annotated with (#pragma omp parallel... etc.) to run on one CPU only. Can I achieve this easily? Is there a GCC flag by which I can switch of autoparallelization and have parallelism where I explicitly annotate with OpenMP?
Note that the OpenMP if clause exerts run-time rather than compile-time control over the concurrency. It means that while the condition inside the if clause might evaluate to false when the program is executed, which deactivates the parallel region by setting the number of threads in its team to 1, the region would still expand to several runtime calls and a separate function for its body, although this would not lead to parallel execution. The OpenMP runtime might also keep a running pool of OpenMP threads busy-waiting for tasks.
The only way to guarantee that your OpenMP code would compile as a clearly serial executable (given that you do not link to parallel libraries) is to compile with OpenMP support disabled. In your case that would mean no -fopenmp option given to GCC while the code is being compiled.
Related
I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.
OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).
I am new to OpenMP and I have this code of a Sparse Matrix-Vector Multiplication and it runs in between 40 - 50 sec. and has total 4237 MFlops/s. Is there any way to get it faster?
Ss I have edited the post the complete code und Aas an input I have 2 matrices one with 50000 Element and the secound with 400000.
The main problem is when ever I try something different, I get the time to go even worse.
#pragma omp parallel for schedule (static,50)
for (int i=0; i< (tInput->stNumRows); ++i) {
y[i] = 0.0;
for (int j=Arow[i]; j<Arow[i+1]; ++j)
y[i] += Aval[j]*x[Acol[j]];
}
The way thing you can do to improve the performance of the code is to use vectorization (thanks to SIMD instructions). Here is the resulting code:
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
Note that y[i] is not read/written continuously in the loop enabling further compiler optimizations. Please take care to compile the code in -O3 (or /O2 for MSVC) for the code to be effectively vectorized. However, this is probably not enough for this code to be vectorized.
Indeed, one issue with this code is the memory indirection x[Acol[j]] which is very hard to vectorize efficiently. Recent x86-64 processors (the ones with AVX2) and very recent ARM processors (the ones with SVE) have SIMD instructions to do that (although they are not great still due to the memory access pattern). Without these instructions, no compiler will likely vectorize the code. Thus, you should tell to your compiler it can use theses instructions (assuming the target processor is actually recent). For GCC/Clang, one way is to use the non-portable -march=native. Another way is to use -mavx2 combined with -mfma on x86-64 processors (although this does not seems to be as good as -march=native in this case for very complex reasons).
Another way to improve the code is to mitigate possible load balancing issues and unwanted overheads. Indeed, load balancing issues can appear in your code if the expression Arow[i+1]-Arow[i]+1 is very different for many i values. In that case, you can use a guided schedule or a dynamic one. However, keep in mind that using a non-static schedule may introduces a significant overhead (especially if the loop is very small or the gap between values is huge). Finally, you can move the omp parallel directive outside the timing loop body since this can introduce a significant overhead (due to the thread creation regarding the target OpenMP runtime).
Note that the above solutions assume the input matrices are big enough so parallelism is useful. Moreover, if x is huge, the code will likely be bounded by the memory hierarchy and there is not much you can do. Sparse matrix computations are often slow because of such issues.
Here is the final code:
#pragma omp parallel
{
// Timing loop
// [...]
#pragma omp for schedule(guided)
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
// [...]
}
EDIT: with your input data, the best solution on my machine (with Clang/IOMP) is not to use multiple threads at all since 400000 elements can be computed in roughly 0.3 ms and the overhead of sharing the work between threads is bigger.
Using Nsight Eclipse Edition 10.2 to debug a plain C++ code using gdb 7.11.1.
The code uses a pragma call to OpenMP for forking a for-loop.
The following is a minimal working example,
where a simple array q is filled with values of another variable p:
#pragma omp parallel for schedule (static)
for(int p=pstart; p<pend; p++){
const unsigned i = id[p];
if(start <= i && i < end)
q[i - start] = p;
}
In debug mode I would want use the step-in function (classically F5) to follow how the array q gets filled in with p's. However, that steps over the for loop altogether, and resumes where the parallel threads join again.
Is there a way to force stepping into a pragma directive/openMP loop?
Is there a way to force stepping into a pragma directive/openMP loop?
That will depend on the debugger, but it's also not entirely clear what it would mean. Since many threads execute the parallel loop, would you expect each of them to stop and then step together? How do you expect to show the different state of each thread? (each will have its own p and i). What happens if the thread control flow diverges?
There are debuggers which can do some of this (such as TotalView on Linux), but it's not trivial to do (and TotalView costs money [which is entirely fair and reasonable :-)]).
What you may need to do is set a breakpoint inside the loop, and then handle it being hit by N threads...
(Which doesn't answer your precise question, but does let you see what's going on in the loop, which is possibly what you really need to do!)
I have something like that :
for (b=from; b<to; b++)
{
for (a=from2; a<to2; a++)
{
dest->ac[b] += srcvec->ac[a] * srcmatrix->weight[a+(b+from)*matrix_width];
}
}
that i'd like to parallelize using cilk. I have written the following code :
for ( b=from; b<to; b++)
{
dest->ac[b] =+ __sec_reduce_add(srcvec->ac[from2:to2-from2] * (srcmatrix->weight+(b*matrix_width))[from2:to2-from2]);
}
but the thing is, I could use a cilk_for on the primary loop, but if the reduce operation is already spawning thread, won't the cilk_for augment the thread overhead, and slow the whole thing down ?
And should I add restrict to dest and src args to further help the compiler ? or is it implicit in this case ?
(ps: I can't try the code right now because of
internal compiler error: in find_rank, at
c-family/array-notation-common.c:244
on
neu1b->ac[0:layer1_size]=neu1->ac[0:layer1_size];
that i'am trying to solve also.)
restrict is not implicitely the case. Furthermore Cilk is implemented using the work-stealing concept. Cilk does not necessarily spawn extra threads for extra work. It works with pushing tasks on a work stack. More info about the internal working can be found on the Cilk FAQ. The Intel compiler might handle things differently than GCC with Cilk. Intel vTune and the intel vectorization report can help you to measure performance differences and indicate whether it's compiled to SIMD or not. With the Intel compiler you can also indicate SIMD operations as follows:
#pragma simd above your loop
or
array notations:
a[:] = b[:] + c[:] to program vectorized array operations.
I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}